[CHK/MERGE] CHKMap.iter_interesting_nodes() and fetch updates

Thu Nov 20 20:05:34 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Attached is a patch for the chk (split-inventory) branch (brisbane-core).

In incorporates everything I have that hasn't been merged yet.

1) CHKMap._dump_tree() which has proven quite useful in figuring out how
pages are actually broken up.

2) Corrections to CHKMap.map() (and InternalNode.map()) so that at least
map() seems to be stable and correct. (We still need to address .unmap()).

3) Also, as part of now tracking self._prefix, I think it would be
reasonable to always cache the common prefix. (Even in LeafNodes) as it
is something you can keep up-to-date by comparing the current common
prefix with the single new key, rather than having to compare all keys
at-once. It also gives us a good point to start changing the serializer
to factor out the common prefix in internal and leaf nodes.

4) The biggest update is chk_map.iter_interesting_nodes() which takes
some root keys to exclude and some keys to include, and then can walk
what it needs to find the subset that is interesting.

At the moment, it does one "deep" pass on uninteresting, and then does
the minimal pass it can on all of the interesting keys. We need to have
the full set of uninteresting, because the values may only exist in a
deep LeafNode.

However, at the very root page, it *is* smart enough to filter out any
exact matches it finds at that level.

The algorithm itself could be tweaked significantly, using some info
from how prefixes work, etc, so that it can filter out sub-trees that
can't possibly be interesting/uninteresting.

I have the feeling that the current algorithm is actually sufficient for
the code to be "production ready", though. If the trees really are
similar, then that first culling can filter out most of the subtrees.
And after that we do one full-depth search, and everything else is truly
minimal. Though it is a bit of a shame the full-depth is on the
uninteresting side, as we know we don't want those pages.

5) iter_interesting_nodes() works very well for both
item_keys_introduced_by() and general fetching, and has been hooked into
both places. Replacing the customized _find_file_keys* that Andrew and I
put together.

The big win here is that doing "bzr branch bzrtools-d4 xxx" takes 8s
instead of >1min. (Doing it with existing formats takes 4-5s so there is
still bits that we can improve upon.)
One of the big issues is that we still read all of the inventory bits
twice. Once for item_keys_introduced_by, and once for actually copying
the pages. Which should be addressed by better generic fetch code.

The way it is written, we *do* end up buffering the records from
get_record_stream() which is generally not preferred. At the peak, that
will be 1 CHK page for every revision. So for bzr.dev that could be
approx 25,000*4k = 100MB, and 260MB for MySQL.

Again, the algorithm can be tweaked, so it doesn't have to process as
deep, or buffer as much. I can't think of a great way to avoid *any*
buffering that doesn't end up causing us to read the pages twice. If we
*really* didn't want to buffer the records, the algorithm could be
trivially tweaked to only track the keys rather than the whole record
object.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkklww4ACgkQJdeBCYSNAAMBuACgqcOrtID/ITa6bZ4Iz3ZTGxY+
AXwAn2qA3lA+wI8/lH/oQHTqI/Cu/2aL
=Xgcx
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: brisbane-fetch.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081120/fc880ec6/attachment-0001.diff