[MERGE] Notes on a customized chk+gc index

Tue Mar 24 07:37:25 GMT 2009

John Arbash Meinel wrote:
> I was thinking about how our current indexes scale (approx 38 bytes per
> record after compression), and what it would take to get a better index
> for a chk+groupcompress index.

> For Launchpad, we would get a 13MB => 3.3MB cix, which has a 8.0MB btree
> .tix. Which is cutting about 10MB off the total size, which is almost
> 10% of the total size of a packed Launchpad repository, and is also 10%
> of the on-disk size of the python conversion (now down to 96MB).

> This certainly isn't something that should cause us to block on
> releasing brisbane-core, but it is certainly high up on the "nice to
> have" list.

Wow. What a great analysis and brain dump! I wish I was smart enough to
have written it. :-)

I don't know enough about our current index layer to comment too deeply
on this, but I think we can land this document and debate an implementation
and delivery timeline separately. BTW, at the macro level, I'm pretty torn
about the priority of this. 10% of size is nothing to sneeze at and I
think we're fast approaching the point where index performance is
becoming the bottleneck on numerous operations. OTOH, I worry that
we need to focus on speed first and foremost and we won't know what
speed gains we'll get from this till it's largely done?

bb: tweak

> +Our current btree style index is nice as a general index, but it is not optimal
> +for Content-Hask-Key based content. With CHK, the keys themselves are hashes,

s/Hask/Hash/

> +Scaling up
> +----------
> +
> +We have said we want to be able to scale to a tree with 1M files and 1M
> +commits. With a 255-way fan out for chk pages, you need a 2 internal nodes,

s/need a/need/

> +As a point of reference, one of the largest projects today OOo, has only 170k
> +revisions, and something less than 100k files (and probably 4-5 changes per
> +commit, but their history has very few merges, being a conversion from CVS).
> +At 100k files, they are probably just starting to hit 2-internal nodes, so
> +they would end up with 10 pages per commit (as an fair-but-high estimate), and
> +at 170k revs, that would be 1.7M chk nodes.

s/as an fair/as a fair/

Personally, I think projects like OOo and Mozilla are *far* more interesting to
optimise for than the 1M x 1M case. After all, there's a good argument that
projects like OOo really ought to be restructured into smaller libraries pulled
together using nested trees.

> +To get the smallest index possible, we store only a 2-byte 'record indicator'
> +inside the index, and then assume that it can be decode once we've read the

s/decode/decoded/

> +If we change the analysis
> +

You either need to finish this sentence or remove it.

Ian C.