[RFC] Caching in chk_map.py - advice needed

Sat Mar 7 17:34:47 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> So deserialisation of nodes is taking 5-10% of the
> time in fast-import and I'm thinking "it shouldn't
> need to be doing that because I just serialised
> those nodes". We *are* caching things in chk_map.py
> but it's the mapping from
> 
>   key -> raw bytes
> 
> not the mapping from ??? -> nodes.
> 
> I've trying adding the latter using numerous attempts
> at a lookup key but they have all broken the test suite.
> Can anyone tell me whether that's because what I'm trying
> is do is conceptually wrong or not? Or are the tests
> breaking because the sample data in there is ultra
> simplistic and therefore causing unexpected clashes?

So my guess is that you should be using the sha1 as the key, since that
is what we use to find the bytes anyway.

Further, I think the problem you are having is that LeafNode and
InternalNode are mutable objects. (They directly .apply_delta() and
mutate their internals, and then save the result.) So they aren't safe
to just cache as-is. You need to copy them.

I've also seen _deserialize be a measurable fraction of the overall
time. I'd like to look into tuning/optimizing it before we throw in a
cache, though in the end I think we'll want a cache.

Specifically, the current form is not tuned for the fastest a C
processor could process it, and I think I'd like to do so. Like probably
storing the sha references as 20-byte binary blobs, rather than ascii
strings. That makes them fixed-width, and very fast to read, and it
means that we don't have to push our compressor to take out the redundancy.

I also noticed a decent amount of time spent in "_compute_prefix", which
we may be able to just write into the data, rather than computing it on
each read. (InternalNode should actually already know about this,
LeafNodes may need a bit more info.)

Oh, and if you want to use an LRUSizeCache, you'll need to be careful
about what function is used for 'compute_size'. The default is len() but
for Leaf/Internal nodes that returns the number of items contained inside.
> 
> In the last 24 hours, I've managed to get fast-import
> down from 49m to 3m on my sample data set. But I'm sure
> there's still plenty of scope for tuning the chk-map
> layer. Altogether 32% of the time is taken in there
> while the groupcompress layer takes 20% and the parsing
> layer takes 3.5%.
> 
> On the bright side, importing into a gc-chk255 branch
> is now twice as fast as importing into a 1.9 branch.
> It's still a bit slower though that how fast fast-import
> was back in the old (pre VersionedFiles) days: it use
> to take a mere 1m50s IIRC (and the parsing layer is 30s
> or more quicker now).
> 
> Ian C.
> 

As for groupcompress.py, have you tried setting _FAST=True? This changes
 the delta computation to only look at the fulltext, which generally
saves a good amount of time. I have some ideas how to improve things
without that flag, but for now... When doing conversions I generally set
_FAST=True, and then when it is done, I set it to False and do a 'bzr pack'.

This is _FAST=False is also good because autopack re-packs all the
bytes, but since you haven't finished converting, you can't have the
optimal packing anyway.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmysDcACgkQJdeBCYSNAANi4gCdEh22F29zLB01z7eZ5/YBfMUs
QVgAn1l9czSaI1NiftjfwVVWhUVyuGP3
=JXAJ
-----END PGP SIGNATURE-----