[MERGE] Repository.get_revision_graph improvements

Tue Feb 26 22:16:55 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2008-02-25 at 18:20 -0600, John Arbash Meinel wrote:
>> I think that is a strong argument for reworking our index layer.
> 
> Do you mean tweaking or completely redesigning? I have a branch that
> adds a compressed hashtable to the front of the index with the goal of
> reducing bisection in memory; but this is not a complete redesign, the
> basics:
>  - indices are write-once
>  - N static indices
> is retained.

Well getting away from our current scaling performance. If the hashtable
actually becomes O(1) we'll probably be doing great. We certainly
shouldn't be scaling 7x slower because we have 3x as many pack files.

> 
> Also we can do more with the same disk data- better locality of
> reference, query last-hit ones first, sort for initial-lookup by size
> (smaller first?) etc...
> 
> -Rob
> 

I *think* a global hashtable will do very well. Since you can aggregate
all of the indexes that you've read. With a decent 32-bit hash we are
very unlikely to get collisions anyway (even with 1 million nodes, the
keyspace is 4 billion, so there is only 1/4000 entries used.)

If we start mixing the hash spaces we might start getting collisions
(since there are a lot more (file_id, revision_id) combinations) but
that is still a pretty low chance.

I'm guessing there are better systems than a hashtable for read
performance, and maybe for locality of reference on lookup performance.
However, it will be a large upgrade over what we currently have, and the
complexity is reasonable.

I've also been wondering about compressing (gzip/bzip2) a text hash
table, or using a binary hash table. I really like having something you
can read without specialized tools, but binary would give us a nice
fixed width for each hashtable entry, and if it means we read 1/2 the
data before we can start doing anything, that seems like a nice win.

I was thinking something like:

INDEX := HEADER HASH_TABLE INDEX_NODE*
HEADER := "Bazaar Pack index format foo vYYY\n"
HASH_TABLE := TABLE_LENGTH HASH_NODE*
HASH_NODE := HASH_KEY OFFSET
HASH_KEY := 32BIT_BIGEND_INT
OFFSET := 32BIT_BIGEND_INT
INDEX_NODE := (HASH_KEY?) KEY INFO

By having the hash table entries be fixed size, then you know how much
space to give it at the beginning of the file. You could also compress
it, and just reserve enough space to hold it uncompressed. You would
still have a fixed max size, and just have to read less data when you
actually go to read the file.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHxI/WJdeBCYSNAAMRAgJ9AJ9eQz4JWGteNHr1bolON/dcNMg/aACfTNB3
TnWqrQvJBEjCqAlSbl0l/ck=
=zjOE
-----END PGP SIGNATURE-----