[MERGE] Improve knit extraction time

John Arbash Meinel john at arbash-meinel.com
Wed Jan 31 00:50:05 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As some of you have been seeing on bazaar-commits, I've been working on
a list structure that can keep track of a set of deltas, rather that
having to modify a whole list.

I've been performance testing it with my mozilla conversion. Where the
last inventory has 458 deltas to a fulltext, and 55,000 lines in the
final inventory.

I'd like to do a few more performance comparisons on smaller trees, but
in general any solution works when things are small enough :)

There were a few things that were not obvious after I got everything
working. It turns out that when you are applying a large number of
deltas you end up with a lot of small patches. And that is when the
linked-list starts becoming a bottleneck. So you still want to collapse
back to a full text from time to time.

What I found was that if you collapse when the linked list grows above
500 entries I got the best tradeoff of performance. With 458 patches, it
collapses 7 times.

To benchmark this, I used:
python -m timeit \
 -s "import bzrlib.branch; print bzrlib.__file__" \
 -s "b = bzrlib.branch.Branch.open('bzr/mozilla/branches/HEAD')" \
 -s "last_rev = b.last_revision();" \
  "txt = b.repository.get_inventory_xml(last_rev)"

In other words, open a branch, and time how long it takes to extract the
last inventory text.

With bzr.dev this takes around                10.0s
If I never collapse this takes                 9.2s
If I collapse when num links > 2000 (2 times)  7.8s
              when num links > 1000 (3 times)  7.0s
              when num links > 750  (4 times)  6.7s
              when num links > 500  (7 times)  6.5s
              when num links > 250  (13 times) 6.6s
              when num links > 100  (28 times) 6.7s

So I picked 500 because it seemed like a nice low spot.

I'm going to be running some more benchmarks, doing something like
extracting every inventory text from bzr.dev. Just to make sure it is a
general improvement, and not just an improvement when num lines is huge
and num deltas is bigger than normal.

But one thing I would like to end up doing after this patch, is to
change the default max deltas from 200 to 500 or maybe even higher,
because that can really shrink the size of inventory.knit.

I also have a few more ideas about how things could be optimized, but I
think the patch as it stands is reasonable to incorporate.

There is one update which is not reflected in the above numbers. I was
using 2 lists (one for content ranges, the other for a 'next node'
pointer). Collapsing that into a single list improved performance by
another 100ms. So current time is 6.4s.

I believe this will improve 'bzr checkout --lightweight' (aka
build_tree) time. But that benchmark will take a while.

By the way, under --lsprof the bulk of the remaining time is actually
spent reading the knit .kndx file. (get_inventory_xml takes 23s, but of
that time get_inventory_weave() takes 18.5s, extracting the text itself
by applying 450 patches is only 5.5s). So I think spending some energy
to improve our .kndx format could certainly be worthwhile. Admittedly
most .kndx files don't have 215,660 lines. But trees with long histories
will. (Real world testing shows branch.repository.get_inventory_weave()
takes from 7-11s, which is rather long).

John
=:->


PS> This is also a case where --lsprof was a savior and inaccurate at
the same time. It revealed the fact that num links was growing to long.
But it claimed that num_links == 100 was much worse. I guess because it
penalizes __iter__ more than _find_nodes_for_pos(). Maybe because one is
a generator and the other is a list with no measurable function calls.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFv+e8JdeBCYSNAAMRAqmoAJ99dVkWJGcLpOQAGQRyezpEwhqxiACePNXc
sooLdk6ODsD21A6rEoX0vKU=
=94xO
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: list_patch.patch
Type: text/x-patch
Size: 297252 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070130/4214c066/attachment-0001.bin 


More information about the bazaar mailing list