[MERGE] Comparison cache to speed up diff

Fri Jul 20 21:31:51 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> This implements the first part of my diff-speedup plans:  A cache of
> SequenceMatcher.get_matching_blocks output.  This is useful because file
> comparisons have poor scaling properties.
> 
> So a list of matching blocks is associated with the working tree SHA1
> sum and the basis tree's file revision.
> 
> Caches are created when you run diff, reused the next time you run diff,
> and cleared when you commit.
> 
> I tested against a bzr.dev tree after doing "revert -r -50"
> 
> For vanilla bzr, the best diff time was:
> real    0m4.538s
> user    0m4.020s
> sys     0m0.296s
> 
> When writing the cache for the first time, the best result was 1.3x slower:
> real    0m6.114s
> user    0m5.276s
> sys     0m0.492s
> 
> When reusing the cached data, the best result was 1.7x faster:
> real    0m3.556s
> user    0m2.856s
> sys     0m0.308s

That looks like it is only 1.27 faster than not using a cache at all. It is
1.7x faster than when you had to write the cache, but it is only marginally
faster than not having a cache at all.

> 
> As usual, file access time is a significant factor: According to
> lsprof/kcachegrind: get_file takes 49.61 of 103.05 ticks.
> 
> Aaron

I wonder if you are hitting this 2x. Once to check the file sha1, and then you
read the files again when you actually compare the content.

This is one of those cases where we might be better off if _iter_changes()
could report that something might be changed, but it isn't sure, rather than
having it actually read the file.

In general, it doesn't seem helpful to have 'DirState.update_entry' actually
read the file and generate the sha1. Instead we should have it just return the
information it has (after doing an out-of-date check). And have a way to tell
it later "here is the actual information for that file, keep track of it for me".

Anyway, I'm not sure if you have ideas what needs to be done here, but I'm
hesitant to add a cache that only improves performance by 1s (out of 4.5s) that
costs 1.5s (out of 4.5s) to create.

I think you mentioned that 'bzr merge' can populate this, which could be really
nice. Not to mention that many people run 'bzr diff' before they run 'bzr commit'.

I'm a little concerned about the layering to get all of that working (you have
to pass the tree.get_matching_blocks down into the VersionedFile.add_lines()
layer so that it can re-use the blocks while committing, etc).

Anyway, on to more of the code-level review:

=== added file 'bzrlib/interbasisworking.py'
...

You don't have a copyright statement, or a docstring at the top of this file.

And you should have 2 spaces between the imports and the first class definition.

+    @needs_read_lock
+    def get_matching_blocks(self, file_id):
+        return None
+

You have this function defined twice for InterBasisWorking, once as a no-op and
once fully defined.

=== added file 'bzrlib/tests/intertree_implementations/test_diff.py'

you also need 2 blank lines at the start of this file

You seem to be adding the cache to all working tree formats (even very old
ones). Is this reasonable? Certainly older bzr won't worry about having an
out-of-date cache (it will just ignore the file), but it also won't clear the
cache on commit. Do we want to clear the whole cache on commit? What about a
partial commit?

=== added file 'bzrlib/tests/workingtree_implementations/test_compare_cache.py'
...
+    def test_basic_caching(self):
+        tree = self.make_branch_and_tree('.')
+        # empty caches can be cleared
+        tree._clear_comparison_cache()
+        tree.lock_read()
+        self.addCleanup(tree.unlock)
+        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id@')
+        self.assertIs(None, comparison)
+        blocks = [(0, 0, 5)]
+        tree._set_cached_comparison('file-id', 'sha1', 'rev-id@', blocks)
+        tree = workingtree.WorkingTree.open('.')
+        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id@')
+        self.assertEqual(blocks, comparison)
+        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id at 2')
+        self.assertIs(None, comparison)
+        comparison = tree._get_cached_comparison('file-id2', 'sha1', 'rev-id@')
+        self.assertIs(None, comparison)
+        comparison = tree._get_cached_comparison('file-id', 'sha2', 'rev-id@')
+        self.assertIs(None, comparison)
+        tree._set_cached_comparison('file-id', 'sha1', 'rev-id@', None)
+        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id@')
+        self.assertIs(None, comparison)
+        tree._set_cached_comparison('file-id', 'sha1', 'rev-id@', blocks)
+        comparison = tree._get_cached_comparison('file-id', 'sha1', 'rev-id@')
+        tree._clear_comparison_cache()

^- You seem to have a 'comparison = tree._get_cached_comparison' that you don't
do anything with.

I'm guessing that you wanted to add one at the end, such that
_clear_comparison_cache() ensures that '_get_cached_comparison' returns None.

You seem to be asserting that it can indeed generate a 'cache_path' for unicode
paths, but you don't actually test what the result is:
+    def test_cache_paths(self):
+        tree = self.make_branch_and_tree('.')
+        self.assertEqual('comparison-cache/a%2B+c',
+                         tree._comparison_cache_path('a+', 'c'))
+        self.assertEqual('comparison-cache/a+%2Fc',
+                         tree._comparison_cache_path('a', '/c'))
+        tree._comparison_cache_path(u'\u1234', u'\u1234')
+        self.assertRaises(UnicodeDecodeError,
+            tree._comparison_cache_path, '\xff', u'\xff')

Specifically, what happens when the filesystem cannot properly represent these
paths?

If these are revision_ids and file_ids, (sha hash as well?) then you have to be
aware that in memory revision_ids and file_ids are utf-8 strings, not Unicode.
That will certainly not be representable on some filesystems. (well, maybe as
8-bit strings, which should be okay since they aren't paths that are observed
by the user).

You are adding a new parameter to 'internal_diff' and 'external_diff'
(file_id). I'm not sure if you are testing that it still does the right thing
if file_id = None. Certainly I didn't see a direct test for
'get_matching_blocks(file_id=None)'.

I know there are plugins/etc that are using internal_diff directly, so we need
to make sure we have a valid fallback in case they are not available.

Also, internal_diff is used when comparing two committed trees (bzr diff -r
10..11). I could have just missed the part where you implemented
'get_matching_blocks()' to compare between any 2 committed trees. (Or it may
just be in your other patch).

+    @staticmethod
+    def _comparison_cache_path(file_id, revision_id):
+        file_part = urllib.quote(file_id.encode('utf-8'), '')
+        revision_part = urllib.quote(revision_id.encode('utf-8'), '')
+        return 'comparison-cache/%s+%s' % (file_part, revision_part)

^- This should be explicitly wrong as file_id and revision_id should already be
utf-8 strings. (osutils.safe_file_id() and osutils.safe_revision_id())

The reason it *works* is because you can take an ascii string, decode it to
unicode, and encode it to utf-8. Which is what str.encode() does. (It decodes
using the site.py default encoding).

+    def _set_cached_comparison(self, file_id, sha1, revision_id, blocks):
+        path = self._comparison_cache_path(file_id, revision_id)
+        if blocks is None:
+            self._control_files._transport.delete(path)
+            return

^- This looks like you will get NoSuchFile if you delete something that happens
to not be cached. (Or some other process already cleared it, or some other such
thing).

You should have some tests for the explicit serialization of the cached blocks,
so that we guarantee that future versions of bzr are compatible with all
versions that support this feature.
Arguably the files should contain a file-version header, so that we can make
sure we are reading what we think we are reading.

And it gives us a way to change the cache format, and have old clients fail
gracefully, rather than returning either bogus data, or just falling over.

+        try:
+            self._control_files._transport.put_bytes(path, bytes)
+        except errors.NoSuchFile:
+            self._control_files._transport.mkdir('comparison-cache')
+            self._control_files._transport.put_bytes(path, bytes)

^- You might consider put_bytes_non_atomic() which allows you to supply
'create_parent_dir=True' if you so desire.

We may not want to, though. 'put_bytes_non_atomic()' will create the file in
the target location, and write to it directly. Rather than writing to a temp
file, and renaming into place.

As you don't have an 'end-of-content' marker, we don't have a way to know if
the write finished successfully. We can get away with it for knits, because all
data is only referenced/considered valid when we know that the write succeeded.
(.knit is referenced from .kndx, and .kndx is only valid if we read the
trailing ':' at the end of the line).

+        transport = self._control_files._transport.clone('comparison-cache')
+        try:
+            filenames = transport.list_dir('.')
+        except errors.NoSuchFile:
+            return
+        for filename in filenames:
+            if filename == '.':
+                continue
+            transport.delete(filename)

^- Interesting, I would have thought 'transport.list_dir()' wouldn't report
'.'. But if it does, than certainly we need to skip it.

Code-wise, there are a few things to clean up. I'm not convinced that the cache
is a net-win, but I realize you have more experience in the 'diff' world.

I think the current real killers for us are the time it takes to extract data
from knits (which we need to address anyway), and the time it takes to do the
'diff' of 2 texts (which we also need to address).

Then again, you've written the code, and if we can make it at least perform
close to comparable in the "worst" case, and have a large win otherwise, then
it may be worth it.

If 'get_file()' is costing 50% of the total time to do 'bzr diff', do you have
any indications of why it is slow? It certainly seems like we can save a lot
more time by making that faster, than doing diff caching.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGoRu3JdeBCYSNAAMRAnZZAJ4nVxIC/mxVzz8PtkUGFBiLc1JQWQCgot0f
tjtwFlJmc+WzSHL95/9xXuE=
=yW/k
-----END PGP SIGNATURE-----