Compressing weaved revisions?
John A Meinel
john at arbash-meinel.com
Sat Oct 1 05:25:26 BST 2005
Aaron Bentley wrote:
> John A Meinel wrote:
>
>>>Well, just to get the numbers straight, I went ahead and created a
>>>plugin, which can turn the current revision-store into a weave file.
>
>
> Cool. Thanks for testing this.
>
>
>>>The summary is that a weave isn't much bigger than the raw text bytes,
>>>especially if we reformat them to put committer and timezone on a line
>>>by themselves. It doesn't compress quite as well as a tarball, but you
>>>can get down to about 1/3rd the size.
>
>
> Hrm. Odd.
>
>>>The -reformat is to insert a newline before the space for each element.
>
>
> Disappointing that didn't give more compression. It might do really
> well on inventories, though.
I would be curious about inventories. Since filenames would stay
constant, even across content changing. But I wouldn't expect it to add
a whole lot.
>
>
>>>For example:
>>><revision committer="John A Meinel" revision_id="XXYYZZ" ...>
>>>becomes
>>><revision
>>> committer="John A Meinel"
>>> revision_id="XXYYZZ"
>>>etc.
>>>
>>>The -append is my append-only weaves, and the -append-reuse is a version
>>>which keeps a mapping between lines that have existed, and their
>>>numbers, and if it gets the same line twice (even if it was created in
>>>different history), it reuses it (this is possible because all lines are
>>>looked up by number).
>
>
> Maybe I should mention that weave merge works on the principle that
> lines have three states: unborn -> alive -> dead. If you're
> resurrecting lines, you'd probably need another layer (to treat
> resurrected lines as new lines) to do weave-merge.
>
Well, the current implementation lets you just recreate the current
weave format in memory.
>
>>>However, the actual disk storage is 879k which is smaller than either of
>>>the weaves, and smaller than the tar (remember, tar uses 512byte blocks,
>>>and pads with null).
>
>
> OOOoh. I was wondering why the tar file was more than twice the
> apparent-size.
>
>
>>>using a zipfile is another reasonable possibility, as the file is
>>>indexed, giving us a decent time to get to the entries. You can append
>>>to it (though if you crash in the middle, most readers will consider it
>>>completely corrupted).
>
>
> Wasn't it roughly the same as apparent-size?
>
It is approximately apparent-size. But since du -sh is 7+MB, you can
save quite bit of disk space. Whether people actually care or not, I'm
not sure. But people who check using "du -sh" are going to think bzr is
very wasteful.
>
>>>I think the biggest reason we didn't switch, though, was because the
>>>time to add a new version is somewhere in O(N).
>
>
>>>Does anyone know of a python linked-list implementation?
>
>
> Dunno. Actually, I was never certain whether the underlying
> implementation of list was a linked list or an array.
Well, if you just to:
$ python -m timeit "x = []" "for i in range(10000):" " x.insert(0, i)"
10 loops, best of 3: 74.3 msec per loop
$ python -m timeit "x = []" "for i in range(10000):" " x.append(i)"
100 loops, best of 3: 4.85 msec per loop
Which shows that inserting (which moves all elements) is 20x worse than
appending. (Making it 100k rows is even worse).
>
>
>>>I don't know about Aaron's weavediff format, which let you only read
>>>portions of the file, rather than unpacking the entire weave into
>>>memory. It certainly could be < O(N) for inserting a new entry.
>
>
> Yes, but compression time would be O(n) scaling, so the complexity stays
> at O(n). Also, you still need to go back as far as the oldest revision
> that contributed a line to the revision, which will often be the first
> revision.
Well, I'm talking about "compression" in the sense of putting everything
in one file.
Also, I would think that it would be at most O(num_lines), where you
probably can skip a bunch of the intermediate revisions that don't add
any lines. (they've all been deleted)
John
=:->
>
> Aaron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050930/c8c6ec7e/attachment.pgp
More information about the bazaar
mailing list