Prototype "improved_chk_index"

John Arbash Meinel john at arbash-meinel.com
Fri Oct 30 01:59:34 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John Arbash Meinel wrote:
> 
>> So I think we *could* do better about size if we want to put a fairly
>> significant amount of effort into it. The "easy" fixes would be:
>>
>> 1) Move the text content into .cix, and then only have the per-file
>>    graph available in .tix. (Allowing us to remove the 'value' field),
>>    saving about 1.25:1
>> 2) Fix the per-file graph for root nodes to not require a node for every
>>    revision that came from a non-rich-root source. That saves another
>>    1.28:1 for a total 1.6:1 space savings in .tix
>> 3) Think about some way to combine .rix and .iix. Possibly just dropping
>>    the inventory records entirely. We talked about doing that in the
>>    past. The most significant issue is stacked branches needing the
>>    'parent inventories but *not* the parent revisions'. Though we could
>>    do that with a simple flag in the index that said "this revision not
>>    considered 'present'"...
>>    This is 2.3MB of the 30MB in indexes for LP, so <10% total space. But
>>    becoming a more significant fraction if we shrink .cix and .tix.
> 
> As another data point, the FireFox 3.5 import shows:
> 
> * 123M pack file
> * 13M indices
> * 11M checkout/dirstate
> * 4.1M checkout/merge-hashes
> 
> The index sizes are:
> 
> 6.1M	.bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.cix
> 1.2M	.bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.iix
> 1.2M	.bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.rix
> 4.0K	.bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.six
> 4.3M	.bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.tix
> 
> Looking inside the matching .git import (after pack -adf --window=250):
> 
> 4.0K	.git/branches
> 4.0K	.git/COMMIT_EDITMSG
> 4.0K	.git/config
> 4.0K	.git/description
> 4.0K	.git/HEAD
> 48K	.git/hooks
> 4.1M	.git/index
> 16K	.git/info
> 88K	.git/logs
> 123M	.git/objects
> 332K	.git/refs
> 
> .git/objects is the matching pack file.
> 
> So head-to-head, both tools have a 123M pack file. Beyond the pack file,
> git's overheads are 4.1M and ours are 28.1M. That certainly suggests we
> have room for improvement in this area.
> 
> Ian C.
> 

Not quite. ".git/index" is actually the staging area, aka 'checkout'.
.git/objects/**/*.idx is the index files, and .git/objects/**/*.pack is
the pack files.

My guess is that the .pack file is >118MB and ~5MB for the .index files.
ISTR that git index files scale at about 24-28 bytes per sha. Each entry
is a sha-hash and an offset in the .pack file. The compression parents
are in the data stream, not the index, etc. I don't know if the 8-byte
offset is always triggered in newer versions, or only if the .pack is
big enough.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrqSIYACgkQJdeBCYSNAAMyDACcCyoxf8eN15gxnq64Nw2HX+d5
P74AmwcoazKLD5vUZE87Ge0KyiAULKqN
=K4OY
-----END PGP SIGNATURE-----



More information about the bazaar mailing list