Compressing weaved revisions?

Sat Oct 1 02:49:40 BST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> Well, just to get the numbers straight, I went ahead and created a
> plugin, which can turn the current revision-store into a weave file.

Cool.  Thanks for testing this.

> The summary is that a weave isn't much bigger than the raw text bytes,
> especially if we reformat them to put committer and timezone on a line
> by themselves. It doesn't compress quite as well as a tarball, but you
> can get down to about 1/3rd the size.

Hrm.  Odd.
> The -reformat is to insert a newline before the space for each element.

Disappointing that didn't give more compression.  It might do really
well on inventories, though.

> For example:
> <revision committer="John A Meinel" revision_id="XXYYZZ" ...>
> becomes
> <revision
>  committer="John A Meinel"
>  revision_id="XXYYZZ"
> etc.
> 
> The -append is my append-only weaves, and the -append-reuse is a version
> which keeps a mapping between lines that have existed, and their
> numbers, and if it gets the same line twice (even if it was created in
> different history), it reuses it (this is possible because all lines are
> looked up by number).

Maybe I should mention that weave merge works on the principle that
lines have three states: unborn -> alive -> dead.  If you're
resurrecting lines, you'd probably need another layer (to treat
resurrected lines as new lines) to do weave-merge.

> However, the actual disk storage is 879k which is smaller than either of
> the weaves, and smaller than the tar (remember, tar uses 512byte blocks,
> and pads with null).

OOOoh.  I was wondering why the tar file was more than twice the
apparent-size.

> using a zipfile is another reasonable possibility, as the file is
> indexed, giving us a decent time to get to the entries. You can append
> to it (though if you crash in the middle, most readers will consider it
> completely corrupted).

Wasn't it roughly the same as apparent-size?

> I think the biggest reason we didn't switch, though, was because the
> time to add a new version is somewhere in O(N).

> Does anyone know of a python linked-list implementation?

Dunno.  Actually, I was never certain whether the underlying
implementation of list was a linked list or an array.

> I don't know about Aaron's weavediff format, which let you only read
> portions of the file, rather than unpacking the entire weave into
> memory. It certainly could be < O(N) for inserting a new entry.

Yes, but compression time would be O(n) scaling, so the complexity stays
at O(n).  Also, you still need to go back as far as the oldest revision
that contributed a line to the revision, which will often be the first
revision.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDPes00F+nu1YWqI0RAqPtAJ4tP2EuNYvAlNVCl2y4caucPfJLiACeKjoX
Wai7ZRVbfRsCi1HmWws7nDg=
=AM9M
-----END PGP SIGNATURE-----