split-inventory auto-pack doesn't get rid of '.cix'

Mon Dec 8 15:03:50 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> 
> ...
> 
>> More critically, I'm seeing a *lot* of waste in the pack files. I went
>> ahead and hacked in some code to give the size of the packs being packed
>> versus the size of the final pack. And I'm attaching a trimmed log file.
> 
>> The key parts are lines like this:
>> 35166.635  Auto-packing ... which has 20 pack files, containing 38000
>> 35184.372  Auto-packing ... completed 101.269MB => 14.456MB
>> 36581.414  Auto-packing ... which has 21 pack files, containing 39000
>> 36601.117  Auto-packing ... completed 117.795MB => 17.208MB
> 
>> That means that in those 10 pack files that we decided we needed to
>> recompact, we had 86% waste.
> 
>> I don't know exactly why yet. Whether it was because we weren't applying
>> deltas properly, so it was causing us to rebuild the entire inventory
>> each time (which obviously has mostly overlap with the previous
>> inventories), or whether something else weird was happening.
> 
> 
> So I've now changed my mind as to the cause of the wasted space. What is
> happening is that the new differing-serializer-fetch code doesn't worry
> about copying texts multiple times. Specifically the code is:
...

> We could also use an LRUCache here, because while we don't like copying
> twice, it isn't terrible if we do, and we are mostly only worried about
> copying recent texts multiple times anyway.
> 
> 
> (This also explains why autopack and plain pack shrink things, because
> if there are multiple copies of a node, it just removes one.)

So it turns out that this was indeed part of the problem, but my
original assumption still holds true *as well*.

My recent InterDifferingSerializer patch fixes the text key problem by
doing:

if entry.revision == current_revision_id:
  text_keys.add((file_id, entry.revision))

However, even with that fix, I see a significant amount of wasted space.

Without that fix, in the first 1000 revisions of mysql, I see something
like 32MB => 21MB. With the text-key fix it changes to 29MB => 21MB.

I added a debug statement, that logged the size whenever it got a
duplicate key. And in the first 1k revs, I see 20,251 duplicate chk
pages, that amount to 8MB (which is what we see above). The actual
values are all very small (from 175 => 1372 bytes, avg 400 bytes), it is
just that there are 20k of them.

I believe the fundamental cause is the loop around
add_inventory_by_delta. Which always uses the previous tree as the basis
for the next tree, even when it isn't a parent. We *do* sort by
topological order, so some of the time it is a direct parent.

It doesn't help that we have 2 maps per inventory, nor that the average
depth is around 8 deep. So if you have a change to a Leaf node that
collides, you are likely to collide at all internal nodes up the tree.

I don't believe we can prevent it because of stuff like:

A
|\
B C
|/
D

Imagine both B and C add a different file. In that case, if you use
either B or C as the base, then you'll add a new file versus the basis,
which is likely to give an identical leaf when you are done.

So at this point, I'm going to focus on something else, but it does
still concern me a bit. I'm going to do a bit more on the conversion,
and see if we are still around the 80% waste, or if it has dropped now
that we don't duplicate the texts.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk9N1YACgkQJdeBCYSNAAMFuwCfavtCE7TiaSWYVfDW8ytVemgJ
eTQAn1jurY0CipHHaOpUTFsh69l0U3W4
=YQme
-----END PGP SIGNATURE-----