Prototype "improved_chk_index"

Thu Oct 29 21:22:47 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sorry about the double send of this email, my network went spotty right
as I was sending it, and my client claimed it didn't go through.

>> One thing to consider with the .tix index is what we want to end up
>> with. I can see indirecting via .cix for file content, but still we'd
>> want per file graphs, I think.
> 
>> -Rob
> 
> Yeah, I'm still trying to sort that out myself. I agree we want to keep
> the per-file graph, but I'm thinking we might consider storing it
> differently. I haven't really thought of it enough to give good answers
> here.
> 

...

> I can think of ways to compact the per-file ancestry. Like splitting it
> up by file-id, so you have a 'section' with a given file-id, so you
> don't have to repeat the actual file-id string. And a lookup table in
> the beginning to revision-ids, so they can be stored as simple integers.
>  Then you can create the file-id graph as just a bunch of chained integers.
> 
> Also, if you change to store *just* the per file graph, the you save the
> space for pointers into the .pack file. Instead, when streaming data,
> the sha hash for the file content is extracted from the chk pages, and
> sent as just more chk data. In my naive approach, we would still
> occasionally transmit redundant data (sha1 was changed to X then Y then
> X again. Comparing versus parent would see 'X new versus Y'), but not
> enough that I would worry about it.
> 
> John
> =:->

So I started a plugin to experiment with this:
  lp:~jameinel/+junk/bzr-per-file-graph

Basically, it lets me run:
  bzr build-per-file-graph --format=XXX ../repo output.idx

And then try a few different configurations to see how things compare.

1) Removing the 'value' field and *just* storing the
   (file_id,revision_id) keys was a fair win. For my bzr+plugins
   directory, it dropped the size roughly 4.3M => 3.5M
   For Launchpad it was 9.8MB => 7.8MB

2) Removing the "TREE_ROOT" entries is also a big win here. We talked
   about it before, but removing all of them is:
     bzr 3.5MB => 2.5MB
     lp  7.8MB => 6.1MB

3) Creating 1 BTreeGraphIndex for each file-id is a rather large loss. I
   think this is primarily because the indexes for most files are tiny.
   You lose any cross-graph compression, and *gain* a lot of Header
   overhead. (Evidenced by the fact that running 'gzip' over the final
   index saves a lot of space)
         btree  chained  chained+gz
     bzr 3.5MB  4.3MB    3.5MB
     lp  7.8MB  11MB     8.1MB

4) Just creating a btree holding the (file_id, revision_id) pairs
   without their parent pointers is:
         btree  noparent
     bzr 3.5MB  2.4MB
     lp  7.8MB  5.1MB

   Creating a btree that just holds the (file_id,) and (revision_id,)
   lists is:
     bzr 0.8MB
     lp  1.3MB

5) back-of-the-envelope for a possible binary form
   a) Some amount of overhead to map each file-id string to an integer
      And each revision-id to an integer.
      For my two repos:
      		unique file_ids	unique rev_ids	(file_id, revision_ids)
	bzr+	5,120		38,817		119,640
 	lp	11,491		65,919		297,118

      From the above, the btree for bzr would be ~0.8MB + integer cost
      And lp would be 1.3MB + integer cost. We might even want a
      bi-directional mapping for it to be really useful, which would
      double that cost.

   b) Actually storing the graph information. If we used fixed-width
      fields, we would need 2 bytes for the file-id, and 2-3 bytes for
      the revision id. We also need a way to indicate how many parents
      are present for a given entry, we could do that in 1 byte.
      So each row would be 2+3+1+N*(2+3). I assume the average number of
      parents is actually pretty close to 1, which makes that 11 bytes.
      bzr has 119640 text keys, lp has 297118.

      So bzr would be 1.2MB, and lp would be 3.1MB, if you then add back
      in the mapping from id => integer, you are
      bzr > 2.0MB  > 2.8MB bidi
      lp  > 4.4MB  > 5.7MB bidi

   c) So if we want to avoid loading the whole id=>int mapping, we need
      the bidirectional map, and we save very little versus a regular
      index.

6) rethink it a bit. What about a global index, and using a
   small-integer mapping. So we map all possible revision ids and all
   possible file-ids into small integers that are used between all of
   the indexes. Or possibly merge all the indexes into one larger index.

   a) Currently there is a lot of overlap in the content of both the
      revision index and the inventory index. They both repeat all the
      same keys and just point to different offsets in the pack file.
      Signatures would have mostly the same set of keys, though doesn't
      have the ancestry.
      File texts is unique in that its 'key width' is different.
      Conceptually we could even merge revisions, inventories and
      signatures into a single index that points at 3 offsets. Probably
      hard to do internally, though.

   b) If you had a single index mapping long-names to short-names (aka
      ints), it could be shared between all of the indexes for that pack
      file. You could get bi-directional mapping for launchpad in about
      2MB, and then all indexes could be shrunk. (3 bytes per revision
      id maps to about 0.5MB for a revision_id index down from the
      current 2.3MB.

   c) You could map things through a hash. Then one-half of the mapping
      is implied by the hash function (file_id => hash), and the other
      way is stored in an index. You could walk the ancestry in hashes,
      and the look up all associated revision-ids with a single
      multi-hash lookup. And hash size could be 'shrunk' to 8-bytes,
      etc.

   d) These general feel like "ETOOCOMPLEX". I think part of the issue
      is that we allow 'free-form' file-ids and revision-ids, which
      means there is no real upper bound on length or complexity.

      I will note that when looking at the simple btree index without
      values, 7.8MB for 297k keys is 26.3 bytes per entry. Which makes
      it *cheaper* than an equivalent sha-hash to parent-sha-hashes.
      (hg's fixed 2 parent sha hashes would be 60 bytes per entry,
      though you could apply the same '10-byte minimum' and drop it to
      30 bytes, and with some sort of 'expect 1 parent' you could drop
      it to something like 20 bytes, 16 if you went for 8-byte hashes.)

So I think we *could* do better about size if we want to put a fairly
significant amount of effort into it. The "easy" fixes would be:

1) Move the text content into .cix, and then only have the per-file
   graph available in .tix. (Allowing us to remove the 'value' field),
   saving about 1.25:1
2) Fix the per-file graph for root nodes to not require a node for every
   revision that came from a non-rich-root source. That saves another
   1.28:1 for a total 1.6:1 space savings in .tix
3) Think about some way to combine .rix and .iix. Possibly just dropping
   the inventory records entirely. We talked about doing that in the
   past. The most significant issue is stacked branches needing the
   'parent inventories but *not* the parent revisions'. Though we could
   do that with a simple flag in the index that said "this revision not
   considered 'present'"...
   This is 2.3MB of the 30MB in indexes for LP, so <10% total space. But
   becoming a more significant fraction if we shrink .cix and .tix.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrqB6cACgkQJdeBCYSNAAPGfQCggsNcy5x41xlQtU1MwzdalC/o
O1YAoM4f/xPxrzxsaJHK7KaYgNkvmnWw
=xcek
-----END PGP SIGNATURE-----