Prototype "improved_chk_index"

Wed Oct 28 21:33:46 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

A while back I put together a draft of a few ideas I had for improving
our index code to handle CHK pages. The draft is a bit dense with ideas,
and certainly not polished. It is in the bzr source as
'doc/developers/improved_chk_index.txt', or you can read it here:

  http://doc.bazaar-vcs.org/latest/developers/improved_chk_index.html

I've certainly wanted to work on it for a while, so I decided to take a
"slack day" away from memory work, etc, and poke around at it. I was
surprised to find that I have a basic proof-of-concept finished in about
1 day.

  lp:~jameinel/bzr/chk-index

The code is currently only python2.5 compatible because I use
'struct.Struct'. At the moment, I've only implemented the 'first half'
of the spec. Notably:

1) All keys are stored in 'full', as 20-byte sha hashes (in the doc I
   do a lot of theory about how we can 'cheat' and shrink them as small
   as 6-bytes with a very low chance of collision, and collisions could
   be trivially resolved.)
2) I made no move to change the 'groups' themselves to contain
   yet-another index mapping offset => record entry, which would allow
   us to get rid of the "inner_start, inner_length" record per-entry

What *was* implemented was:

1) An index which separates 'group' records into their own collection

2) A single 'mini-index' section that maps sha prefixes to the offset
   where the entries matching that prefix can be found.
   The size of the mini-index is increased/decreased such that we get
   ranges that average ~4k bytes.

3) All of the records are stored as fixed-width binary integers. The
   actual size of each integer is declared in the header.
   So, for example, a 'group' record is 2 integers, 'start in pack' and
   'length of group'. However this can scale down to 'unsigned shorts'
   for trivially small pack files, and up to 'unsigned long long' if we
   have a pack file > 4GB in size.

   The *very good* of this, is that it is compact (zlib cannot extract
   more compression, and bzip2 *expands* it slightly), fast to read, and
   means that you can instantly seek to a single record (fixed width).

   The bad of it, is that the file is complete garbage in a text editor.
   Though I have to say, btree indexes are zlib compressed on 4k
   boundaries, so their content is just as opaque. (and you can't just
   'gunzip' the index because it is zlib not gz, and based on repeated
   compression, rather than compressing the whole stream.)

The code is also simplified from 'btree_index' as I make no attempt to
cap the memory consumption. Once I read something, it stays in the
cache, and when building the index, I don't try to spill anything to
disk. Both could be done, but this was simple, and quite efficient. It
gains some of the memory savings by just being careful about what gets
stored. I also think that a C extension could save quite a bit more.
Right now an entry takes up a StaticTuple + 3 PyInts, it could easily be
transformed into a single 3-integer array, and shrink from around 48
bytes per record to 12 bytes per record. And a dict holding 400k sha1
hashes takes up 12MB for the dict and 18MB for the Keys. This could
pretty easily be dropped down to ~8MB for a custom data structure.

The benchmarks so far are rather impressive. Specifically, I did a
'get_stream()' of content out of both a bzr repository, and a launchpad
repository. I instrumented what 'chk' pages were requested, and then
replayed that request against a single btree index, and a CHKIndex. The
result:

  btree bzr  0.979s  =>  chk bzr 0.4s
  btree lp   9.9s    =>  chk bzr 1.7s

Note that this is also without a huge amount of performance tuning, and
more importantly does *not* include a C/Pyrex extension. (So roughly
5.8x faster for the Launchpad tree.)

As for the space-on-disk... At present the zlib compressed btree code
gets about 32 bytes per node. Which is low. Given a 20byte  hash, + 4
byte group start + 4 byte group end + 4 byte inner start + 4 byte inner
end. You would be at 36 bytes per node.

However, for Launchpad, all 400k keys only compress 4k groups. Which
means we can encode (group start, group end) in the table, and then each
record only needs 2 bytes. Also, for CHKMap entries, the value is never
greater than 64kB, so the 'inner length' can also be encoded as 2 bytes.
So I'm able to encode each entry as 20byte hash, 2 byte group entry, 4
byte inner start, 2 byte inner length, or 28 bytes total.

The numbers on disk:
  btree bzr 3.7MB  =>  chk bzr 3.0MB
  btree lp  16MB   =>  chk lp  12MB

So about 1.3:1 for Launchpad. Note that the structure scales up and
down. By design, very small indexes use smaller 'fixed-width' fields.
Also, the code doesn't support 'odd' sizes, like 3-byte widths, as
'struct' only supports 1,2,4,8. And 1 byte saved per entry is ~0.5MB
saved in the index. (And almost definitely the 4-byte inner start could
be a 3-byte value.)

As for 'scaling up'. All of bzr's chk pages fits in 2.4k groups, all of
launchpad in 4k groups. Some of my numbers in 'improved_chk_index.txt'
are off. Mostly because we don't fill the groups fully. And that is
mostly because compression is better when we only combined related chk
pages. However, it would still take a *lot* of growth to go from 4k
groups up to 64k groups where we would start needing 1 more byte to
encode the group table entry. Mostly to say, I expect us to stay at
27/28 bytes per entry even into really huge projects.

Now the bulk of the math in my proposal was about 'cheating', and not
encoding the full 20 bytes of sha hash. Assuming probability (and not an
active attack), with just 8-bytes of hash, you can hold 100Million
entries, and have only a 1 in 10,000 chance that you have a single
collision.

That would remove 12 bytes-per-record, and you would shave 6MB off of
the final index size. So that 16MB btree becomes a 6MB chk index. Or
that 100M entry btree would be 3.2GB, down to ~1.6GB. Also, the spec
discusses removing redundancy. In that the 'mini-index' can hold a
couple of bytes that aren't in the final 'entry' line. So you can shave
off another 100MB or so. In the 'ideal' spec, 100M entries would still
take ~1GB of disk space.

However, 100M chk entries represents a phenomenal amount of history.
Something like 30M revisions, and that is being generous about the
number of changes. 1-10M is in the realm of possibility.

Now, part of the motivation for improving the chk index was so that we
could also move the text content into chk pages, and that will change
the analysis a bit.

However, the proof of concept came together a lot faster than I expected
(makes me wish I had pushed for it as part of 2.0 :). And the
performance results are pretty good, so I figured I'd share the results
so far.

I may not get back to this for a while. Mostly because we've stated "no
new formats" for 2.1. So while it might be able to be a dev format, it
would not get into active use for > 6 months. And IMO it is only really
worthwhile if we also look into the bit-stripping versions to get a
significant size improvement along with the parsing improvement.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrouLoACgkQJdeBCYSNAANt9gCfRrQRHzNqxKH+EGcto/5zPk/V
lrwAoKzJY/7psKUzVRT5t0E9Qc1+1D/0
=GyRr
-----END PGP SIGNATURE-----