[RFC/PLEASETEST] Per-file graph heads detection during commit for pack repositories.

John Arbash Meinel john at arbash-meinel.com
Wed Nov 14 17:09:53 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> This is a call for testing.
>
...

>
> I think we can make it really fast; partly I want to fix the index
> layer, and graph.heads() still makes way to many calls into the index
> layer which means investment there has a high multiplier for
> performance.
>
> Another thing to try though, is to use the per-file graph again, this is
> hopefully less graph distance to process, at a tradeoff of reading the
> per-file graph data again (which could actually be less data overall in
> some situations).
>
> So the attached patch (not complete - theres a new public api without
> enough tests) - speeds up commit a bit in this case by going for the
> per-file graph approach.
>


In my quick test with a merge of bzr.dev into a feature branch, it made a huge
difference. I ran it under --lsprof and have attached the callgrinds. But
'time' said the difference (under --lsprof) was

bzr commit -m "test"  57.98s user 15.43s system 99% cpu 1:13.42 total

versus

bzr commit -m "test" 2.76s user 0.54s system 100% cpu 3.291 total

Yes, that is 73s versus 3.3s.

I also made sure to run the first commit 2 times, just to get some of the
indexes loaded into memory.

The callgrind seems to put 98% of the time in bzrlib.graph.heads(). I'm
actually thinking that part of the problem is that we don't cache any of the
graph information, so every call needs to go all the way down and bisect
through the index data. On that hunch, I went ahead and put together this patch:

=== modified file 'bzrlib/repofmt/pack_repo.py'
- --- bzrlib/repofmt/pack_repo.py 2007-11-13 21:39:37 +0000
+++ bzrlib/repofmt/pack_repo.py 2007-11-14 17:01:10 +0000
@@ -1510,6 +1510,7 @@
         self._reconcile_does_inventory_gc = True
         self._reconcile_fixes_text_parents = False
         self._reconcile_backsup_inventory = False
+        self._revision_parents = {}

     def _abort_write_group(self):
         self._pack_collection._abort_write_group()
@@ -1571,17 +1572,23 @@
         self._pack_collection.ensure_loaded()
         index = self._pack_collection.revision_index.combined_index
         search_keys = set()
+        found_parents = {_mod_revision.NULL_REVISION:[]}
         for revision_id in revision_ids:
             if revision_id != _mod_revision.NULL_REVISION:
- -                search_keys.add((revision_id,))
- -        found_parents = {_mod_revision.NULL_REVISION:[]}
- -        for index, key, value, refs in index.iter_entries(search_keys):
- -            parents = refs[0]
- -            if not parents:
- -                parents = (_mod_revision.NULL_REVISION,)
- -            else:
- -                parents = tuple(parent[0] for parent in parents)
- -            found_parents[key[0]] = parents
+                if revision_id in self._revision_parents:
+                    found_parents[revision_id] = \
+                        self._revision_parents[revision_id]
+                else:
+                    search_keys.add((revision_id,))
+        if search_keys:
+            for index, key, value, refs in index.iter_entries(search_keys):
+                parents = refs[0]
+                if not parents:
+                    parents = (_mod_revision.NULL_REVISION,)
+                else:
+                    parents = tuple(parent[0] for parent in parents)
+                found_parents[key[0]] = parents
+                self._revision_parents[key[0]] = parents
         result = []
         for revision_id in revision_ids:
             try:

And for that, I ended up with:
bzr commit -m "test"  12.51s user 3.41s system 99% cpu 15.928 total

So the per-file graph is still better in this particular case. (3s versus 16s
versus 73s). But caching the parents makes a big difference. Which also
indicates to me that grabbing the revision parents is a bit too expensive in
general. At the very least, shouldn't the index layer be caching this sort of
thing? (at least caching the pages it has had to read so far)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHOyvgJdeBCYSNAAMRAkpaAJwNZrNmk2Fcke9o32ABDBWUHtE3rQCg1pNR
Cx44gtwVb80vK2rWycHhyJ4=
=C2+/
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commit_merge_robert.callgrind.bz2
Type: application/octet-stream
Size: 31196 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071114/a96ca103/attachment-0003.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commit_merge_dev.callgrind.bz2
Type: application/octet-stream
Size: 31251 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071114/a96ca103/attachment-0004.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commit_merge_parent_cache.callgrind.bz2
Type: application/octet-stream
Size: 31295 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071114/a96ca103/attachment-0005.obj 


More information about the bazaar mailing list