[MERGE] Comparison cache to speed up diff

Fri Jul 20 22:46:15 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> I wonder if you are hitting this 2x. Once to check the file sha1, and then you
> read the files again when you actually compare the content.
> 
> This is one of those cases where we might be better off if _iter_changes()
> could report that something might be changed, but it isn't sure, rather than
> having it actually read the file.

As I have said many times, iter_changes doesn't need to read the file to
know whether it has changed, in the vast majority of cases.  If the WT4
implementation does, then please fix it so that it works the way the
default implementation works.

Specifically, the default implementation determines the size of the
file, and if the file sizes differ, it knows the file has changed.

Similarly, if it stats the file, and finds it matches the cached SHA1
stat, then it knows that the file has not changed.

This covers the vast majority of cases.  In the rare case where the file
stat differs but the size does not, it will need to calculate the SHA1.

The reason _iter_changes does not yield the sha1 is precisely so that it
does not have to know what the sha1 is.

> In general, it doesn't seem helpful to have 'DirState.update_entry' actually
> read the file and generate the sha1. Instead we should have it just return the
> information it has (after doing an out-of-date check). And have a way to tell
> it later "here is the actual information for that file, keep track of it for me".

I have no idea about any of this.  WorkingTree4 is largely a black box
to me.

> Anyway, I'm not sure if you have ideas what needs to be done here, but I'm
> hesitant to add a cache that only improves performance by 1s (out of 4.5s) that
> costs 1.5s (out of 4.5s) to create.

I guess I think that repeat diffs are more common than initial diffs, so
there's still some win.  More once other commands start populating it.

But I hear your concern and I will see if I can reduce the cost of the
initial diff.

> I think you mentioned that 'bzr merge' can populate this, which could be really
> nice. Not to mention that many people run 'bzr diff' before they run 'bzr commit'.

Yes, the additional command support will really make the caching shine.
 We can also have revert (to historical versions) update the cache.

> I'm a little concerned about the layering to get all of that working (you have
> to pass the tree.get_matching_blocks down into the VersionedFile.add_lines()
> layer so that it can re-use the blocks while committing, etc).

Well, this is targeted API work, at least as I understand the term.
tree.get_matching_blocks is key to making diff fast.  If the targetted
APIs reveal problems with the layering, we may have to fix those.

> You seem to be adding the cache to all working tree formats (even very old
> ones). Is this reasonable?

Well, I want it to be tested on all working tree implementations, so
that future working tree formats will support caching properly.  Adding
support to all formats seemed to be the simplest way, and I couldn't see
any harm.

> Certainly older bzr won't worry about having an
> out-of-date cache (it will just ignore the file), but it also won't clear the
> cache on commit.

That seems like an acceptable situation to me.  We've done similar
things with the inventory cache.

> Do we want to clear the whole cache on commit? What about a
> partial commit?

It's a 95% solution.  If we want, we can salvage some comparisons in
some cases.  It just doesn't seem like a high priority.

> +        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id@')
> +        tree._clear_comparison_cache()
> 
> ^- You seem to have a 'comparison = tree._get_cached_comparison' that you don't
> do anything with.
> 
> I'm guessing that you wanted to add one at the end, such that
> _clear_comparison_cache() ensures that '_get_cached_comparison' returns None.

Yeah.  Don't know how that got dropped.

> You seem to be asserting that it can indeed generate a 'cache_path' for unicode
> paths, but you don't actually test what the result is:
> +    def test_cache_paths(self):
> +        tree = self.make_branch_and_tree('.')
> +        self.assertEqual('comparison-cache/a%2B+c',
> +                         tree._comparison_cache_path('a+', 'c'))
> +        self.assertEqual('comparison-cache/a+%2Fc',
> +                         tree._comparison_cache_path('a', '/c'))
> +        tree._comparison_cache_path(u'\u1234', u'\u1234')
> +        self.assertRaises(UnicodeDecodeError,
> +            tree._comparison_cache_path, '\xff', u'\xff')
> 
> Specifically, what happens when the filesystem cannot properly represent these
> paths?

The generated paths are all ASCII, so why would they be a problem?

> If these are revision_ids and file_ids, (sha hash as well?) then you have to be
> aware that in memory revision_ids and file_ids are utf-8 strings, not Unicode.

I'd forgotten, actually.  I'll change that test to use the utf-8 form.

> That will certainly not be representable on some filesystems. (well, maybe as
> 8-bit strings, which should be okay since they aren't paths that are observed
> by the user).

These paths are url-quoted.  URL quoting doesn't use non-ascii characters.

> You are adding a new parameter to 'internal_diff' and 'external_diff'
> (file_id). I'm not sure if you are testing that it still does the right thing
> if file_id = None. Certainly I didn't see a direct test for
> 'get_matching_blocks(file_id=None)'.

Okay.

> Also, internal_diff is used when comparing two committed trees (bzr diff -r
> 10..11). I could have just missed the part where you implemented
> 'get_matching_blocks()' to compare between any 2 committed trees. (Or it may
> just be in your other patch).

This patch contains everything I've done so far.  I haven't done any
special work for get_matching_blocks betweeen two historical trees yet,
so they fall back to the default implementation:
InterTrees.get_matching_blocks().

> +    def _set_cached_comparison(self, file_id, sha1, revision_id, blocks):
> +        path = self._comparison_cache_path(file_id, revision_id)
> +        if blocks is None:
> +            self._control_files._transport.delete(path)
> +            return
> 
> ^- This looks like you will get NoSuchFile if you delete something that happens
> to not be cached. (Or some other process already cleared it, or some other such
> thing).

Okay, will fix.

> You should have some tests for the explicit serialization of the cached blocks,
> so that we guarantee that future versions of bzr are compatible with all
> versions that support this feature.
> Arguably the files should contain a file-version header, so that we can make
> sure we are reading what we think we are reading.

In some cases, a header would double the size of the files!

> And it gives us a way to change the cache format, and have old clients fail
> gracefully, rather than returning either bogus data, or just falling over.

Sigh, I guess we can do that.

> ^- You might consider put_bytes_non_atomic() which allows you to supply
> 'create_parent_dir=True' if you so desire.
> 
> We may not want to, though. 'put_bytes_non_atomic()' will create the file in
> the target location, and write to it directly. Rather than writing to a temp
> file, and renaming into place.

It important for the operations to be atomic.

> As you don't have an 'end-of-content' marker, we don't have a way to know if
> the write finished successfully.

All bencode encodings are self-terminating, so there's no reason to do
extra work.

> We can get away with it for knits, because all
> data is only referenced/considered valid when we know that the write succeeded.

But it's a different situation, because we append to knits.  These are
atomic-write files.  They can't get renamed into place unless we've
finished writing them.

> Code-wise, there are a few things to clean up. I'm not convinced that the cache
> is a net-win, but I realize you have more experience in the 'diff' world.

I'm not a big cache-lover, but sequence-matching algorithms are
different for me, because of the way they scale.  The best way to
improve the speed of an O(n^2) algorithm is to avoid it completely.

> I think the current real killers for us are the time it takes to extract data
> from knits (which we need to address anyway), and the time it takes to do the
> 'diff' of 2 texts (which we also need to address).

Well, my plan for reducing the time it takes to diff texts is to avoid
it where possible.

> Then again, you've written the code, and if we can make it at least perform
> close to comparable in the "worst" case, and have a large win otherwise, then
> it may be worth it.
> 
> If 'get_file()' is costing 50% of the total time to do 'bzr diff', do you have
> any indications of why it is slow?

Knit indices.  Poor locality of reference.  Basically, we need a new
repository format.

> It certainly seems like we can save a lot
> more time by making that faster, than doing diff caching.

That's 50% of 3.556, or 1.8 seconds.  The diff caching saves about 1
second.  This is not an either/or situation-- these gains are in the
same order of magnitude, and are both worth having, imho.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFGoS0n0F+nu1YWqI0RAhn0AJwLt61F5X3fqukq5KqTlGnaJ4oKiACXWFKX
ifx7fejP57mCgISAbHtBaA==
=BPLd
-----END PGP SIGNATURE-----