Optimizing tree building

Mon Jun 18 14:26:26 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> I have pyrex code for _load_data, which
> 
> a) needs review:
> http://bundlebuggy.aaronbentley.com/request/%3C46422E07.2090800@arbash-meinel.com%3E

I've been waiting until pyrex inclusion policy was all settled.  Is it
all settled yet?

> Also, our indicies aren't really written for speed of parsing. So there are
> probably things we could do. Also, because of dictionary compression, it is a
> little bit tricky to read them backwards. If we started them with their line
> number, that would help. So that we would have:
> 
> 1 revision-xxyyzz
> 2 revision-zzaabb 1
> ...
> 
> Otherwise you have to already know a parent to know when you've actually read
> the right parent record.

It seems like we could almost cache that data, were it not that some
indices get rewritten frequently (who, me? :-)

>> But then how would we know which index held the information for a given
>> revision?  Maybe we put a bloom filter in the filenames?
> 
> We looked into that, too. Bloom filters work okay when you use 1-byte per key
> (1% mark is around 9.6bits, 10% around 4.?? bits). The problem is that with
> case sensitivity you don't have 8-bits per byte that you can use. Also, you
> have to be careful about filename collision when the filters start to fill up.
> (A full filter ends up as 0xFFFFFF, which obviously collides).

You can reduce error rates by looking for direct ancestors of the
revisions you want, so 4 bits could work.  This does give a slight
chance of false negatives due to ghosts.

> Anyway, you end up able to store maybe 100 entries in a reasonable filename
> with a 10% error rate. So it is better than nothing, but 100 isn't very much
> (considering when you are dealing with 10000 revisions, or even 1M revisions).

Well, it's better than four-files-per-revision.  But no one would be
crazy enough to do that, would they? :-)

> You *could* put a decent sized bloom filter at the start of the file (1k is
> pretty good, 10k is getting decent all around).

I suppose you could also do an index of indices.  That's basically what
I was using the filenames for.

> But then again, what would you gain if you stored a trie of the revision ids in
> the first 10k. It wouldn't be nearly as compact as the bloom filter, but you
> would be a long way in knowing *exactly* what revision ids are present.

Still haven't really gotten my head around tries.

> I do think we will have an index. I thought we had discussed having it
> write-once (so each blob gets its own index, and you don't append to existing
> blobs, every so often you combine blobs into a new blob, and write out a new
> index).
> 
> That has the nice property that you can optimize the index in any way that you
> want (bisect, bloom filter, whatever).
> 
> It means you may need to read more total indices, but after the combining step,
> there are fewer again.

Right, and I guess we're already scaling badly at that point, so the
indices hardly make a difference.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGdogC0F+nu1YWqI0RAi6SAJ0bzaYmHvlSS1FE74zfQukdGoz4ZQCdFIq+
dcCyl175TAQ3gHmWpJs/XkU=
=DKS/
-----END PGP SIGNATURE-----