Compressing weaved revisions?

Fri Sep 30 23:34:47 BST 2005

Aaron Bentley wrote:
> Hi all,
> 
> For trees with, e.g. 500 revisions, the revision storage may actually be
> larger than tree storage.
> 
> The new format doesn't weave revisions, because they're almost
> completely different for each revision.  They also don't compress well.
>  For my example tree, the revisions are 47 k uncompressed and 32 k
> compressed.
> 
> But if they're tarred and gzipped, the tarfile is 7.2k, and if they're
> tarred and bzipped, that's 5.7k.
> 
> I expect if revisions were weaved and gzipped, we'd see roughly the same
> level of compression, i.e. 6x
> 
> They would have the disadvantages of weaves, but we've already accepted
> those disadvantages for texts and inventories.
> 
> Aaron

Well, just to get the numbers straight, I went ahead and created a 
plugin, which can turn the current revision-store into a weave file. It 
requires that the branch be upgraded, simply because I trusted the 
newformat weave code more than the old weave code.

I also tried my append-only weave code, in 2 different flavors.

The summary is that a weave isn't much bigger than the raw text bytes, 
especially if we reformat them to put committer and timezone on a line 
by themselves. It doesn't compress quite as well as a tarball, but you 
can get down to about 1/3rd the size.

The plugin is available from:
http://bzr.arbash-meinel.com/plugins/revstore2weave/

Anyway, here are the numbers, for 1782 revisions:
$ du -ks .bzr/revision-store/
7255    .bzr/revision-store/
$ du -ks --apparent .bzr/revision-store/
879     .bzr/revision-store/

$ ll revisions.*
   859138 revisions.zip

  1904640 revisions.tar
   246171 revisions.tar.gz
   163373 revisions.tar.bz2

   982332 revisions.weave-plain
   292994 revisions.weave-plain.gz
   217492 revisions.weave-plain.bz2

   897148 revisions.weave-reformat
   298318 revisions.weave-reformat.gz
   214849 revisions.weave-reformat.bz2

  1011324 revisions.weave-append
   294991 revisions.weave-append.bz2
   344249 revisions.weave-append.gz

   989592 revisions.weave-append-reuse
   292164 revisions.weave-append-reuse.bz2
   341579 revisions.weave-append-reuse.gz

The .tar is just a tarball of the revisions, and then compressed as normal.
I used 2 formats for the weavefile. The -plain is just storing it as is.
The -reformat is to insert a newline before the space for each element.
For example:
<revision committer="John A Meinel" revision_id="XXYYZZ" ...>
becomes
<revision
  committer="John A Meinel"
  revision_id="XXYYZZ"
etc.

The -append is my append-only weaves, and the -append-reuse is a version 
which keeps a mapping between lines that have existed, and their 
numbers, and if it gets the same line twice (even if it was created in 
different history), it reuses it (this is possible because all lines are 
looked up by number).
In general, you can see that reuse doesn't help much. For the revisions, 
usually there are a bunch of commits in a row by the same person, so the 
reuse only helps the transition periods. The extra notation that my 
append format uses easily drowns out any reuse advantages.

 From the above, you can see that reformatting helps slightly (982k vs 
897k), but not much.
Compression changes it from ~900k down to 300k, so not quite 6x, more 
like 3x.

However, the actual disk storage is 879k which is smaller than either of 
the weaves, and smaller than the tar (remember, tar uses 512byte blocks, 
and pads with null).

So reformatting the weaves (which allows reuse of the committer and 
timezone lines), basically cancels out the extra overhead of using a 
weave versus just plain text packed next to eachother. Compressing gives 
us another 3x.

using a zipfile is another reasonable possibility, as the file is 
indexed, giving us a decent time to get to the entries. You can append 
to it (though if you crash in the middle, most readers will consider it 
completely corrupted).

I think the biggest reason we didn't switch, though, was because the 
time to add a new version is somewhere in O(N). Certainly the current 
code has to read the whole file, then scan through all the lines with a 
stack of where we are so that it can reproduce the ancestors, and then 
it needs to diff against them, and do an insert into a list (which is a 
contiguous array.) Does anyone know of a python linked-list implementation?

I don't know about Aaron's weavediff format, which let you only read 
portions of the file, rather than unpacking the entire weave into 
memory. It certainly could be < O(N) for inserting a new entry.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050930/2398b06f/attachment.pgp