brisbane-core+ric-commit results

John Arbash Meinel john at arbash-meinel.com
Thu Mar 26 17:43:29 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Comparing the full python 3.0 repo using latest bzr.dev
> (1.9 format) against latest brisbane-core (gc-chk255-big format)
> using my work-in-progress usertest branch, we're looking
> pretty good now on the whole. John is working on the
> branch-outside-a-shared-repo case and I believe still has
> some unlanded stuff that ought to help ls.
> 
> See attached for the summary results.
> 
> Ian C.
> 
> PS: my brisbane-core has Robert's pending commit patch applied
> (with the necessary subtree-related one line tweak).

Branch:unshared	67.9	360.6 bzr-btree branch $work_basename
                               repo/${work_basename}


^- So at least one reason this is slow, is because we are still using
the code which iterates inventories 2 times to do branch. One time early
on to find the text keys we want to copy, and then a second time to
actually copy the inventory.

bzr.dev has an optimized 'Packer' code path which avoids this. We were
trying to get away from doing it again, since it is god for Streaming
support. However, this has a much bigger performance penalty for CHK
repositories, because iterating the inventory is a *lot* slower.

At a minimum, XML inventories never apply deltas, they extract the text
keys from the raw deltas. At a minimum for CHK we have to completely
unpack the root node fulltext figure out which of the X in 255 leaf
nodes have changed, and then unpack X leaf node fulltexts, and then
figure out which of the 1 in ~20 items has changed.

So to find the N keys for 1 revision, requires deserializing (255 +
N*20) lines.

We would still end up paying this cost when we actually transfer the chk
pages, but at least we wouldn't pay it 2 times. (First time we need the
text keys, second time we only need the chk pages, but we don't know
whether references are LeafNodes or InternalNodes until we've parsed them.)

I did also track down a line that has poor scaling. Namely:
=== modified file 'bzrlib/chk_map.py'
- --- bzrlib/chk_map.py   2009-03-24 19:36:34 +0000
+++ bzrlib/chk_map.py   2009-03-26 17:26:44 +0000
@@ -1436,8 +1436,11 @@
             # only care about external references.
             node = _deserialise(bytes, record.key,
                                 search_key_func=None)
             if isinstance(node, InternalNode):
- -                chks = set(node.refs())
- -                chks.difference_update(all_uninteresting_chks)
+                chks = set(node.refs()) - all_uninteresting_chks
                 # Is set() and .difference_update better than:
                 # chks = [chk for chk in node.refs()
                 #              if chk not in all_uninteresting_chks]

The issue here is that python's "x.difference_update(y)" takes time
proportional to 'y' regardless the size of x. While 'x.intersection(y)'
takes time proportional to min(x,y). As does x.difference(y) (or x - y).

Just that one line change changes "bzr branch bzr.dev" from 3m51s to 2m41s.

I'm looking at doing a larger overhaul of 'iter_interesting_nodes()',
because I'd like to reformulate it a bit (so that it can return records
that we can copy across as-is, and not have to buffer anything, and
still remove the refcycles, etc).

I was hoping that Robert or Andrew could help with changing the fetch
code, since it integrates deeply with their streaming work. Which isn't
something that I fully grasp.

How would you define a new network_name() so that you can request data
be streamed in 'groupcompress' order, etc. I've already written a rather
decent optimized streaming in GCPacker (which does revs, inventories,
chk pages in a good layout, and then texts). I just don't know how to
hook that in to Fetch.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknLvsEACgkQJdeBCYSNAANpAgCgp9Aju1c6/EaEAg9RdjNVy2wI
vcEAn3Iomqhw+llSGgJ8CckEJZl64s1r
=wkrE
-----END PGP SIGNATURE-----



More information about the bazaar mailing list