[MERGE] Add an InventoryEntry cache for xml deserialization
John Arbash Meinel
john at arbash-meinel.com
Wed Dec 10 23:08:42 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
The attached patch adds a cache when we deserialize InventoryEntries.
The good news is that will all optimizations in play, I can get down to
50% of the time to do "repository.revision_trees(last_100_revs)". The
safe code is still 60% of the current time.
I spent a while profiling it, and this is the basics of what I found.
1) First an assumption. For a given (file_id, revision_id) pair, the
InventoryEntry must be the same.
2) Adding a cache has the potential to speed things up dramatically.
Probably because we have something like a 90-99% hit rate.
3) The time spent in cElementTree.fromstring() is actually only a tiny
fraction of the total time spent. XML is not really the problem, getting
it from the XML representation into the Inventory representation is what
is hurting us. (And the fact that we can't represent *part* of a tree
very well in XML form.)
4) Our ratio of access hits to misses is very skewed (because of the hit
rate). elt.get() isn't a very expensive function. And the overhead of
recording an access as part of LRUCache.get() causes difficulty.
5) A FIFO cache is a lot better for this reason, but we have to take
more care when we actually fill the cache. This is because 90% of what
is in the cache we want to re-use, the records come to us in "sorted"
order. So once we start pushing off a couple of the records, we tend to
push out all of the interesting ones, and restart the cache.
A MySQL tree has 8k entries. With a cache of 10k, it gets flushed about
30 times for the last 100 revisions. Increasing the cache to 15k means
it never gets flushed.
I have ideas of how to write a C/Pyrex LRUCache that would have much
less overhead, but I didn't feel like solving that right now. In the
mean-time, I added the resize() functionality to FIFOCache that I meant
to add before.
I currently resize to 2x the size of an inventory, which seemed
reasonable. We could probably go down to 1.5x, but I figured holding 2
inventories in memory is perfectly reasonable (the code using this is
holding 100 in memory...)
6) Adding the FIFO cache brings us to 60-70% of the original cost of
inventory deserialization.
7) There is a fairly sizeable overhead in Inventory.add(), if only
because it adds 2 function calls. Also, Inventory.add() has to handle
things like entries already having children, etc. Which we know not to
be true for inventory deserialization.
For safety, I left a couple of those checks in, especially because
"time" showed that removing them only changes things by about 1%.
But inlining Inventory.add() saves another 5-10% of total time.
8) One other really big win that I didn't do is to get rid of calling
InventoryEntry.copy() for files and symlinks. This is safe to do if the
Inventories are treated as readonly, or if "apply_delta" would create
new File records, but I'm not sure if we can trust Inventory users to
not mess with the entries in our cache.
It *is* a fairly large win (25s => 20s, or about 20% of remaining time).
So I would consider adding a flag somewhere to tell the deserializer
that "I know I have to treat these as readonly".
(It is never safe to not copy InventoryDirectory, because its list of
children is different even when self.revision has not changed. The old
'last-modified-effects-directory' issue.)
9) _iter_inventory_xmls() is caching all of the fulltext bytes of all
inventories requested, which is a huge waste of memory. I made it a tiny
bit better by popping off records as-you-go, but this could be made far
more efficient.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklAS/oACgkQJdeBCYSNAAPPYwCgwNdchsDwOIUcLQwyDr8G94SH
fYsAn13lF9g6oNaHw96cvG3uKVemJT2P
=kluG
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: xml_cache.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081210/c9e23d0d/attachment-0001.diff
More information about the bazaar
mailing list