split-inventory auto-pack doesn't get rid of '.cix'

Thu Dec 4 17:56:02 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> More critically, I'm seeing a *lot* of waste in the pack files. I went
> ahead and hacked in some code to give the size of the packs being packed
> versus the size of the final pack. And I'm attaching a trimmed log file.
> 
> The key parts are lines like this:
> 35166.635  Auto-packing ... which has 20 pack files, containing 38000
> 35184.372  Auto-packing ... completed 101.269MB => 14.456MB
> 36581.414  Auto-packing ... which has 21 pack files, containing 39000
> 36601.117  Auto-packing ... completed 117.795MB => 17.208MB
> 
> That means that in those 10 pack files that we decided we needed to
> recompact, we had 86% waste.
> 
> I don't know exactly why yet. Whether it was because we weren't applying
> deltas properly, so it was causing us to rebuild the entire inventory
> each time (which obviously has mostly overlap with the previous
> inventories), or whether something else weird was happening.
> 

So I've now changed my mind as to the cause of the wasted space. What is
happening is that the new differing-serializer-fetch code doesn't worry
about copying texts multiple times. Specifically the code is:

basis_tree = FIRST_TREE
for tree in self.source.revision_trees(revisions):
  delta = tree.inventory._make_delta(basis_tree.inventory)
  for old_path, new_path, file_id, entry in delta:
    ...
    text_keys.add((file_id, entry.revision))
  ...
  basis_tree = tree

Revisions are sorted topologically, so it is somewhat likely that a
child will follow directly after a parent, but I think it is by no means
guaranteed. So the deltas can, essentially, be against arbitrary
ancestors. We don't use the filter "if entry.revision ==
current_revision_id", so we can easily get lots of already-copied text
keys duplicated. We also could just keep a set of "copied_text_keys =
set()". Though that "grows without bounds". Then again I don't know if
it takes up a huge amount of space. There are 160k text keys after 39k
revisions of mysql. If the file_id, and revision_id are interned, then
it is really only the overhead of a tuple. Which is approx 30bytes.
Which would be 4.8MB.

We could also use an LRUCache here, because while we don't like copying
twice, it isn't terrible if we do, and we are mostly only worried about
copying recent texts multiple times anyway.

(This also explains why autopack and plain pack shrink things, because
if there are multiple copies of a node, it just removes one.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk4GbIACgkQJdeBCYSNAAN3fgCeLGXzMuztOtBVx+W6S0/T1To7
QNIAoMxayMCuz87jm/0/u+2SUn5RHidm
=ThHu
-----END PGP SIGNATURE-----