[RFC] cache serialized form of Inventory objects

Fri Jan 19 21:36:03 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm working on making some of our conversion code (cvsps-import) be a
little bit faster, so I've started doing some hacking on bzr, looking
for some of the low hanging fruit for optimization.

We know that inventory handling is our Achilles' heel, so I'm trying to
work on a few things that I know are a little slow.

The first one I decided to tackle was our serialization overhead. Right
now for every commit we build up a complete Inventory (for the moz tree,
that is 50k objects), and then we serialize it out to XML, and then we
do a diff on those lines versus the basis.

This changes the serializer, so that when it is reading the inventory,
it saves the XML text. And then in memory, if the InventoryEntry is
modified, the caller is required to set ie.serialized_form = None. This
is required because IE isn't really meant to have a huge amount of
layering, as performance is important. (I could override every member of
IE so that we monitor for changes, and if it is changed, we reset
'serialized_form', but that would perform very poorly).

The saved form should include a marker indicating what serializer was
used, so that the next serializer knows if it is safe to use the cache.

It is sort of a negative side effect of how we treat inventories that we
have to do it this way, but that is what we get for directly modifying a
data structure.

Anyway, the attached patch actually passes the complete test suite. So
at least in bzrlib everything seems to be safe.

In converting a small project (800 revs, ~150 files) there wasn't a huge
savings. --lsprof claims a fairly large difference saying that 100,000
calls to _append_entry now only takes 5.6s rather than taking 20.4s (a
4x increase). This even shows up in the overall time (178s versus 195s
== 17s increase), but some of that is random effect.

But as I have stated before, I think --lsprof overly penalizes our
serializer. Under real-world testing the time difference is only a
couple seconds out of a total conversion time of 70s.

I still need to test it on the Mozilla source tree, to see if I can see
a big difference there. I honestly expect a difference, but I don't
expect it to be huge. (reading is probably more expensive than writing,
and copy is also expensive, and this doesn't help with any of those things).

I thought we might at least like to discuss the ideas it brings up. I
think far better would be working in tuples when possible, and not
having to load the whole Inventory every time. But we are a long way
from there. This can be hacked in now.

I'm going to work on one other "hack" and hopefully post it later.
(Caching the last inserted inventory text, so we don't have to rebuild
it by applying patches)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFsTnDJdeBCYSNAAMRAjh+AJ9kuGI7nIqN0pzYKU1BLKz7Dj9I1gCfTpfk
yfZna8rOBEpwrI2y+E93uhg=
=ok6X
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cache_serialized.patch
Type: text/x-patch
Size: 13631 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070119/3016ab3c/attachment.bin