[RFC] switching to content-based storage for texts

Fri Mar 20 02:18:43 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

With the new results with compression, I've been exploring our storage a
bit. One of the surprising bits is how many times we have duplicate
content in our repository. As an example, the launchpad tree:

>>> b = branch.Branch.open('launchpad-chk255big/devel')
>>> b.lock_read()
>>> r = b.repository
>>> k = r.texts.keys()
>>> len(k)
241442
>>> s1 = r.texts.get_sha1s(k)
>>> len(s1)
241442
>>> unique_s1 = set(s1.itervalues())
>>> len(unique_s1)
141231

I'm pretty sure I understand that correctly, to say that there are 241k
(file_id, revision_id) pairs, but they only define 141k unique bits of
content. Or 58% unique texts. Now, if these texts end up close together
in the GC stream, they probably compress well. And our sort order makes
that sort of thing likely.

However, I wonder if it isn't at least partially why using LZMA gives us
30% more compression. (Again using LP as the example 123MB => 100MB.)

So I just realized that a large portion of this is actually probably
directory and symlink entries. Since they always have the content of the
empty string.... Which doesn't specifically change the delta size much
as encoding "f\x00" is rather cheap...

I wrote some code that for a given group, if you have already seen this
sha, skip and re-use the index position. (Which seemed fine and safe
once I really thought about it.)

Testing it on 'bzr-gc255-big-nolabel' showed a change of 2.8 MB down to
2.78MB

I think there is something that can be gained by having a special record
for empty entries, since they are actually pretty common. (Mostly
because of directories and directory renames. but probably also the
occasional empty file.) I'll be poking at both a little bit more. It
dropped the LP size only 3MB (123->120MB), even though it had:
  Total deduped = 43395, total empty = 56411
Out of 241k files.

My guess is the big win would be cross-group removal of duplicates.
Since in a given group, we probably have great delta compression. It
also comes to mind that if you found that at some point in their
ancestry, 2 file_ids had identical content, then likely they will
compress well together before and after that point...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknC/QMACgkQJdeBCYSNAAOHCQCfWX+TCk+Xd06g569HyWeLHNbu
RYoAniLPR0IRkMngp/aCJoEy6PUqMCpc
=67hG
-----END PGP SIGNATURE-----