[MERGE] Updates to generic fetch logic

Wed Feb 18 19:05:31 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As part of the brisbane-core work, I put together an update to the
generic fetching logic.

As it stands today, we do conversions by computing a delta to the last
revision we just converted, and then applying that. This turns out to be
an okay solution, but we can do better.

In the attached patch, we keep a cache of recently converted trees, and
always try to compute a delta versus one of the parents. I went even
further and made it so that we compute a delta versus all available
parents, and then select the one that generates the smallest delta.

I don't expect this to have much impact outside of the brisbane-core
work, but the patch applies cleanly to bzr.dev, so I'd rather get it
merged into trunk.

When I tested this in the past, the breakdowns I saw was:

1) For ~50% of the revisions, the last converted revision was the
optimal base. This is fairly obvious, given that it will be true for any
linear sequence of commits.

2) For ~35% of the revisions, the first parent is the preferred base

3) For the remaining 15% of revisions, the second parent is the
preferred base.

These numbers will be slightly skewed, in that when parents aren't
present, I just use the last conversion, etc.

My guess on those numbers is that:

1) 50% of the time you have simple linear commits
2) 35% of the time you merge feature branches into trunk
3) 15% of the time you merge trunk into a feature branch

(Note these numbers are from my recollection of the mysql conversion,
but should be close.)

It makes sense that "trunk" which aggregates many features is going to
factor into the equation. Consider a merge between feature-foo and trunk
(in either direction).

If you compute the delta versus "trunk", you have only the changes
introduced by "feature". If you compute the delta versus "feature" you
have all of the changes from all features that "trunk" has integrated.

Thus when merging features => trunk, you want to use the primary parent
(trunk), and when merging trunk => features you want to use the second
parent (again trunk).

Anyway, for converting the first 1k revs of mysql from 1.9 => dev5h255
this drops the time from about 2m30s to 2m20s. However, if we are
willing to change the xml8_serializer code so that we don't copy
InventoryFile objects, it improves even more. Because then the
"_make_delta" code is able to use "x is y" for 90% of the inventory
objects, rather than having to compare any attributes. The tradeoff with
this code is that by doing an extra compare, you may get a better delta,
meaning you have fewer chk pages to evaluate. By making the delta
computation cheaper, the tradeoff is a net win.

Oh, I should also mention that if we change the xml8 serializer, it also
decreases memory consumption, and makes it much more reasonable to both
extract many trees, as well as cache them during conversion.

John
=:->

PS> For Ian-

With this patch, the look-before-you-leap patch, and changing the xml8
serializer, it changes the time to convert "mysql-5.1 -r 525" from 2m30s
down to 1m38s, and decreases the disk consumption from 72M to 57M
(though that includes one round of obsolete packs).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmcW/sACgkQJdeBCYSNAAOEnwCeJnKfxfieOIbJgJJ1vdogdOVY
XtwAoNi1hERu1VDoWDegDHD7OjgHXaQ7
=QE/m
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: generic_fetch_delta_selection.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090218/02e692bd/attachment-0001.diff