[CHK/MERGE] CHKMap.iter_interesting_nodes() and fetch updates

Thu Nov 20 20:36:49 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is a small update on the earlier patch. It changes the logic to
significantly reduce buffering for interesting nodes. And in the "create
a new branch" case, it should not need to buffer any nodes. There are
still edge-cases that would cause more buffering than desired, but the
"common" cases should be more well behaved.

John
=:->

John Arbash Meinel wrote:
> Attached is a patch for the chk (split-inventory) branch (brisbane-core).
> 
> In incorporates everything I have that hasn't been merged yet.
> 
> 1) CHKMap._dump_tree() which has proven quite useful in figuring out how
> pages are actually broken up.
> 
> 2) Corrections to CHKMap.map() (and InternalNode.map()) so that at least
> map() seems to be stable and correct. (We still need to address .unmap()).
> 
> 3) Also, as part of now tracking self._prefix, I think it would be
> reasonable to always cache the common prefix. (Even in LeafNodes) as it
> is something you can keep up-to-date by comparing the current common
> prefix with the single new key, rather than having to compare all keys
> at-once. It also gives us a good point to start changing the serializer
> to factor out the common prefix in internal and leaf nodes.
> 
> 4) The biggest update is chk_map.iter_interesting_nodes() which takes
> some root keys to exclude and some keys to include, and then can walk
> what it needs to find the subset that is interesting.
> 
> At the moment, it does one "deep" pass on uninteresting, and then does
> the minimal pass it can on all of the interesting keys. We need to have
> the full set of uninteresting, because the values may only exist in a
> deep LeafNode.
> 
> However, at the very root page, it *is* smart enough to filter out any
> exact matches it finds at that level.
> 
> The algorithm itself could be tweaked significantly, using some info
> from how prefixes work, etc, so that it can filter out sub-trees that
> can't possibly be interesting/uninteresting.
> 
> I have the feeling that the current algorithm is actually sufficient for
> the code to be "production ready", though. If the trees really are
> similar, then that first culling can filter out most of the subtrees.
> And after that we do one full-depth search, and everything else is truly
> minimal. Though it is a bit of a shame the full-depth is on the
> uninteresting side, as we know we don't want those pages.
> 
> 5) iter_interesting_nodes() works very well for both
> item_keys_introduced_by() and general fetching, and has been hooked into
> both places. Replacing the customized _find_file_keys* that Andrew and I
> put together.
> 
> The big win here is that doing "bzr branch bzrtools-d4 xxx" takes 8s
> instead of >1min. (Doing it with existing formats takes 4-5s so there is
> still bits that we can improve upon.)
> One of the big issues is that we still read all of the inventory bits
> twice. Once for item_keys_introduced_by, and once for actually copying
> the pages. Which should be addressed by better generic fetch code.
> 
> The way it is written, we *do* end up buffering the records from
> get_record_stream() which is generally not preferred. At the peak, that
> will be 1 CHK page for every revision. So for bzr.dev that could be
> approx 25,000*4k = 100MB, and 260MB for MySQL.
> 
> Again, the algorithm can be tweaked, so it doesn't have to process as
> deep, or buffer as much. I can't think of a great way to avoid *any*
> buffering that doesn't end up causing us to read the pages twice. If we
> *really* didn't want to buffer the records, the algorithm could be
> trivially tweaked to only track the keys rather than the whole record
> object.
> 
> John
> =:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkklymEACgkQJdeBCYSNAAPXmQCeJabOKYO1yx3cQEWUZx9YjiR0
CXEAnA8iZ9V20g/EG+IjwGZYhB/+eDqL
=9hox
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: brisbane-fetch.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081120/89184156/attachment-0001.diff