Optimizing tree building

Fri Jun 8 01:38:45 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> Attacking tree-building from the other angle, I did a profiling run with
> kcachegrind.
> 
> We spend 72% of our time in workingtree_4.get_file_lines.
> We spend 26% of our time in _KnitIndex._load_data.
> We spend 49% of our time in _get_content_maps, reading the various
> records for knit data.

I have pyrex code for _load_data, which

a) needs review:
http://bundlebuggy.aaronbentley.com/request/%3C46422E07.2090800@arbash-meinel.com%3E

b) Probably needs to be cleaned up slightly for how we want to include pyrex
code in bzrlib.

It should help that 26% a lot at least.

Also, our indicies aren't really written for speed of parsing. So there are
probably things we could do. Also, because of dictionary compression, it is a
little bit tricky to read them backwards. If we started them with their line
number, that would help. So that we would have:

1 revision-xxyyzz
2 revision-zzaabb 1
...

Otherwise you have to already know a parent to know when you've actually read
the right parent record.

> 
> It seems essential to me that we avoid parsing entire indices, because
> the size of these indices scales linearly with the length of history.
> 
> I haven't got a clear idea how to do that.  One option would be to sort
> the indices, so that we can bisect into them.  It would help if they had
> fixed-length records.
> 
> But then we couldn't append to indices, and rewriting indices would have
> the same scaling problem.  So perhaps we'd have a series of indices?
> 
> But then how would we know which index held the information for a given
> revision?  Maybe we put a bloom filter in the filenames?

We looked into that, too. Bloom filters work okay when you use 1-byte per key
(1% mark is around 9.6bits, 10% around 4.?? bits). The problem is that with
case sensitivity you don't have 8-bits per byte that you can use. Also, you
have to be careful about filename collision when the filters start to fill up.
(A full filter ends up as 0xFFFFFF, which obviously collides).

Anyway, you end up able to store maybe 100 entries in a reasonable filename
with a 10% error rate. So it is better than nothing, but 100 isn't very much
(considering when you are dealing with 10000 revisions, or even 1M revisions).

You *could* put a decent sized bloom filter at the start of the file (1k is
pretty good, 10k is getting decent all around).

But then again, what would you gain if you stored a trie of the revision ids in
the first 10k. It wouldn't be nearly as compact as the bloom filter, but you
would be a long way in knowing *exactly* what revision ids are present.

> 
> I really don't know where this will lead to.  But even if we don't move
> to a blob format for storing revision data, the index problem has to be
> solved.
> 
> Anyone?
> 
> Aaron

I do think we will have an index. I thought we had discussed having it
write-once (so each blob gets its own index, and you don't append to existing
blobs, every so often you combine blobs into a new blob, and write out a new
index).

That has the nice property that you can optimize the index in any way that you
want (bisect, bloom filter, whatever).

It means you may need to read more total indices, but after the combining step,
there are fewer again.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGaKUVJdeBCYSNAAMRAmWrAJ92jF8SL0UVI4ZaHC/puDQ25v3FeQCgv1F+
xj3T5jjfn/jf5DR2BCBXL/s=
=qYax
-----END PGP SIGNATURE-----