Bazaar, OLPC, and the "no-packs" repository format.

Thu Dec 20 13:54:08 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:
> On Št, 2007-12-20 at 06:25 -0600, John Arbash Meinel wrote:
>> Robert has at least conceptualized a hash-based index, so lookups become more
>> of an O(1) operation. Which makes it O(N) for N packs, but that is better than
>> O(N log(M)) for our current indexes. I wonder if you could even combine the
>> hash maps to make it O(1) across all indexes that you have read...
> 
> I've implemented a prototype of a similar idea about two weeks ago, but
> gave up on finishing it because I found out that the main bottleneck for
> me was reading data from all over the big pack files. The indexing code
> is in the attachment, but it's only a part of the plugin, which itself
> requires patched bzrlib and breaks 'packs-0.92', so I did't bother with
> attaching the whole package (it's not very usable, anyway).

For what operations?

I'm mostly looking at stuff like "get_revision_graph()" at the moment, which
doesn't have to go to the .pack file at the moment, and I would guess never
needs to go there.

If you are talking about something like fetching, then probably the bulk data
is a bigger issue.

Again, we would like to inverse topo sort the data in the .pack file. Though
most likely we will do a hybrid. Because of how our deltas work out, I'm
thinking to do:

 FullText, Delta1 based on FT, Delta2 based on FT or Delta1, ...

Where the FullTexts are inverse topo sorted (newest FT at the front), and then
all deltas based on the FT come after.

> 
> The general idea is that the index is split into segments, which are
> grouped by their left-hand history (similar to toposort, but backwards)
> and the index header contains 32 bit hashes of each key of each segment.
> The combined index also uses a global cache, so in most cases you don't
> search in all indexes, but only the ones that actually contain the key
> (for 15k revisions in bzr.dev I get no hash collisisions, so it's
> something like O(1+very_small_constant*N)). But that helps only in cases
> where you have a few big indexes, not a *lot* of tiny ones.
> 
> Lukas
> 

That seems like it would be difficult to look up an arbitrary key. Since you
don't know where it fits relative to other keys.

I believe Robert's idea was to have the hash at the beginning point to all keys
present in the index. To keep it small, the hash would be variable size, based
on however many bits it took to make things unique. Something like doing:

hash = sha1(key)
hash_bits = hash[:n_bits]

So you are rather confident that with 20-bytes of sha hash you will not get
collisions, but you may only need 2-bytes of that 20.
It also means that the raw hash could be the same across all files, and you
just need to extract the right number of bits for each index.

Then again, it doesn't hurt us a lot to get a hash collision, so having a
16-bit or 32-bit hash could be reasonable.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHanP/JdeBCYSNAAMRAhuLAKC5mmKWDaALz8JdR0AWYsiy+BiLUQCg2Ylq
p84hYo+0Mnb4nZuXF9cDluM=
=agpD
-----END PGP SIGNATURE-----