Optimizing tree building
aaron.bentley at utoronto.ca
Mon Jun 18 14:26:26 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
John Arbash Meinel wrote:
> I have pyrex code for _load_data, which
> a) needs review:
I've been waiting until pyrex inclusion policy was all settled. Is it
all settled yet?
> Also, our indicies aren't really written for speed of parsing. So there are
> probably things we could do. Also, because of dictionary compression, it is a
> little bit tricky to read them backwards. If we started them with their line
> number, that would help. So that we would have:
> 1 revision-xxyyzz
> 2 revision-zzaabb 1
> Otherwise you have to already know a parent to know when you've actually read
> the right parent record.
It seems like we could almost cache that data, were it not that some
indices get rewritten frequently (who, me? :-)
>> But then how would we know which index held the information for a given
>> revision? Maybe we put a bloom filter in the filenames?
> We looked into that, too. Bloom filters work okay when you use 1-byte per key
> (1% mark is around 9.6bits, 10% around 4.?? bits). The problem is that with
> case sensitivity you don't have 8-bits per byte that you can use. Also, you
> have to be careful about filename collision when the filters start to fill up.
> (A full filter ends up as 0xFFFFFF, which obviously collides).
You can reduce error rates by looking for direct ancestors of the
revisions you want, so 4 bits could work. This does give a slight
chance of false negatives due to ghosts.
> Anyway, you end up able to store maybe 100 entries in a reasonable filename
> with a 10% error rate. So it is better than nothing, but 100 isn't very much
> (considering when you are dealing with 10000 revisions, or even 1M revisions).
Well, it's better than four-files-per-revision. But no one would be
crazy enough to do that, would they? :-)
> You *could* put a decent sized bloom filter at the start of the file (1k is
> pretty good, 10k is getting decent all around).
I suppose you could also do an index of indices. That's basically what
I was using the filenames for.
> But then again, what would you gain if you stored a trie of the revision ids in
> the first 10k. It wouldn't be nearly as compact as the bloom filter, but you
> would be a long way in knowing *exactly* what revision ids are present.
Still haven't really gotten my head around tries.
> I do think we will have an index. I thought we had discussed having it
> write-once (so each blob gets its own index, and you don't append to existing
> blobs, every so often you combine blobs into a new blob, and write out a new
> That has the nice property that you can optimize the index in any way that you
> want (bisect, bloom filter, whatever).
> It means you may need to read more total indices, but after the combining step,
> there are fewer again.
Right, and I guess we're already scaling badly at that point, so the
indices hardly make a difference.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the bazaar