[RFC] Change to use 'merge_sort' for per-file-log

Fri Sep 19 04:49:24 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The attached patch overrides my previous work, by changing how we determine
what revisions to display. This is RFC because of the behavior change. Though
I think most will probably *prefer* this behavior.

So, what is the behavior change. The new code shows merges *towards mainline*
that include the a new change to the file, but ignore merges away. So for example:

A-.    # A creates 'foo'
|\ \
B C D  # C modifies 'foo'
|/  |
E   |  # Pulls in 'foo' verbatim from C
|\  |
| \ |
|  \|
|   F  # Pulls in 'foo' verbatim from E
|  /
| /
|/
G

The old algorithm would show A, C, E, and F, the new algorithm will not show F.

You can argue for it either way, but if you look at the output of "bzr log
bzr" you'll wonder why the first lines are:

    ------------------------------------------------------------
    revno: 3677.1.6
    committer: Vincent Ladeuil <v.ladeuil+lp at free.fr>
    branch nick: 233817-missing
    timestamp: Thu 2008-09-11 21:36:38 +0200
    message:
      merge bzr.dev

It happens because this revision merges new changes to 'bzr' from bzr.dev into
vila's feature branch.

I personally believe the new algorithm is overall better, because these sorts
of changes are 'noise'.

I'm also attaching 2 files, one is the output of "bzr log bzr" with the
existing algorithm, the other is the new algorithm.

Also, I have one more addendum, which works around issues with
GraphIndex.get_parent_map(lots_of_keys). Basically, by requesting keys in 1000
key "chunks", I avoid the "buffer_all" overhead, but also avoid some of the
extra bisection overhead (especially for smaller indexes.)

=== modified file 'bzrlib/log.py'
- --- bzrlib/log.py	2008-09-19 03:21:39 +0000
+++ bzrlib/log.py	2008-09-19 03:41:15 +0000
@@ -562,12 +562,16 @@
     # Do a direct lookup of all possible text keys, and figure out which ones
     # are actually present, and then convert it back to revision_ids, since the
     # file_id prefix is shared by everything.
- -    text_parent_map = branch.repository.texts.get_parent_map(text_keys)
     # Using a set of revisions instead of a set of keys saves about 1MB (out of
     # say 400). Not a huge deal, but still "better".
- -    modified_text_revisions = set(key[1] for key in text_parent_map)
- -    del text_parent_map
- -    del text_keys
+    get_parent_map = branch.repository.texts.get_parent_map
+    modified_text_revisions = set()
+    chunk_size = 1000
+    for start in xrange(0, len(text_keys), chunk_size):
+        next_keys = text_keys[start:start + chunk_size]
+        modified_text_revisions.update(
+            [k[1] for k in get_parent_map(next_keys)])
+    del text_keys, next_keys

     result = []
     if direction == 'forward':

With unlimited chunk-size (all at once) we trigger buffer_all (the current
code). Memory is around 244MB (down from 400MB with the earlier code), and it
takes 28s to process.

With chunk size of 1 (the other extreme), we never buffer all, memory
consumption stays low (82MB), but we still take 23s.

Chunk_size = 100  we are at 82MB and 12s.

Chunk_size = 1000 and we get buffer_all on the smaller indexes, and it takes
10.9s and 94MB.

I don't really like hacking around GraphIndex in log.py code, as it is likely
to be worse for BTreeIndex when we get around to them. However, it does show a
dramatic improvement, so I think we should consider doing *something*.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI0yFEJdeBCYSNAAMRAoIaAKCmG5GuoFAaFkTU13KskFLgMvyurgCeI76r
HmXM86EqOcXsWpKSLeufvII=
=ebjR
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lighter_file_log3.patch
Type: text/x-diff
Size: 22446 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080918/97ded1ae/attachment-0001.bin 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bzr_log_bzr.txt
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080918/97ded1ae/attachment-0002.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: new_bzr_log_bzr.txt
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080918/97ded1ae/attachment-0003.txt