Compressing weaved revisions?
John A Meinel
john at arbash-meinel.com
Fri Sep 30 23:34:47 BST 2005
Aaron Bentley wrote:
> Hi all,
>
> For trees with, e.g. 500 revisions, the revision storage may actually be
> larger than tree storage.
>
> The new format doesn't weave revisions, because they're almost
> completely different for each revision. They also don't compress well.
> For my example tree, the revisions are 47 k uncompressed and 32 k
> compressed.
>
> But if they're tarred and gzipped, the tarfile is 7.2k, and if they're
> tarred and bzipped, that's 5.7k.
>
> I expect if revisions were weaved and gzipped, we'd see roughly the same
> level of compression, i.e. 6x
>
> They would have the disadvantages of weaves, but we've already accepted
> those disadvantages for texts and inventories.
>
> Aaron
Well, just to get the numbers straight, I went ahead and created a
plugin, which can turn the current revision-store into a weave file. It
requires that the branch be upgraded, simply because I trusted the
newformat weave code more than the old weave code.
I also tried my append-only weave code, in 2 different flavors.
The summary is that a weave isn't much bigger than the raw text bytes,
especially if we reformat them to put committer and timezone on a line
by themselves. It doesn't compress quite as well as a tarball, but you
can get down to about 1/3rd the size.
The plugin is available from:
http://bzr.arbash-meinel.com/plugins/revstore2weave/
Anyway, here are the numbers, for 1782 revisions:
$ du -ks .bzr/revision-store/
7255 .bzr/revision-store/
$ du -ks --apparent .bzr/revision-store/
879 .bzr/revision-store/
$ ll revisions.*
859138 revisions.zip
1904640 revisions.tar
246171 revisions.tar.gz
163373 revisions.tar.bz2
982332 revisions.weave-plain
292994 revisions.weave-plain.gz
217492 revisions.weave-plain.bz2
897148 revisions.weave-reformat
298318 revisions.weave-reformat.gz
214849 revisions.weave-reformat.bz2
1011324 revisions.weave-append
294991 revisions.weave-append.bz2
344249 revisions.weave-append.gz
989592 revisions.weave-append-reuse
292164 revisions.weave-append-reuse.bz2
341579 revisions.weave-append-reuse.gz
The .tar is just a tarball of the revisions, and then compressed as normal.
I used 2 formats for the weavefile. The -plain is just storing it as is.
The -reformat is to insert a newline before the space for each element.
For example:
<revision committer="John A Meinel" revision_id="XXYYZZ" ...>
becomes
<revision
committer="John A Meinel"
revision_id="XXYYZZ"
etc.
The -append is my append-only weaves, and the -append-reuse is a version
which keeps a mapping between lines that have existed, and their
numbers, and if it gets the same line twice (even if it was created in
different history), it reuses it (this is possible because all lines are
looked up by number).
In general, you can see that reuse doesn't help much. For the revisions,
usually there are a bunch of commits in a row by the same person, so the
reuse only helps the transition periods. The extra notation that my
append format uses easily drowns out any reuse advantages.
From the above, you can see that reformatting helps slightly (982k vs
897k), but not much.
Compression changes it from ~900k down to 300k, so not quite 6x, more
like 3x.
However, the actual disk storage is 879k which is smaller than either of
the weaves, and smaller than the tar (remember, tar uses 512byte blocks,
and pads with null).
So reformatting the weaves (which allows reuse of the committer and
timezone lines), basically cancels out the extra overhead of using a
weave versus just plain text packed next to eachother. Compressing gives
us another 3x.
using a zipfile is another reasonable possibility, as the file is
indexed, giving us a decent time to get to the entries. You can append
to it (though if you crash in the middle, most readers will consider it
completely corrupted).
I think the biggest reason we didn't switch, though, was because the
time to add a new version is somewhere in O(N). Certainly the current
code has to read the whole file, then scan through all the lines with a
stack of where we are so that it can reproduce the ancestors, and then
it needs to diff against them, and do an insert into a list (which is a
contiguous array.) Does anyone know of a python linked-list implementation?
I don't know about Aaron's weavediff format, which let you only read
portions of the file, rather than unpacking the entire weave into
memory. It certainly could be < O(N) for inserting a new entry.
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050930/2398b06f/attachment.pgp
More information about the bazaar
mailing list