Compressing weaved revisions?

John A Meinel john at arbash-meinel.com
Sat Oct 1 05:25:26 BST 2005


Aaron Bentley wrote:
> John A Meinel wrote:
> 
>>>Well, just to get the numbers straight, I went ahead and created a
>>>plugin, which can turn the current revision-store into a weave file.
> 
> 
> Cool.  Thanks for testing this.
> 
> 
>>>The summary is that a weave isn't much bigger than the raw text bytes,
>>>especially if we reformat them to put committer and timezone on a line
>>>by themselves. It doesn't compress quite as well as a tarball, but you
>>>can get down to about 1/3rd the size.
> 
> 
> Hrm.  Odd.
> 
>>>The -reformat is to insert a newline before the space for each element.
> 
> 
> Disappointing that didn't give more compression.  It might do really
> well on inventories, though.

I would be curious about inventories. Since filenames would stay 
constant, even across content changing. But I wouldn't expect it to add 
a whole lot.

> 
> 
>>>For example:
>>><revision committer="John A Meinel" revision_id="XXYYZZ" ...>
>>>becomes
>>><revision
>>> committer="John A Meinel"
>>> revision_id="XXYYZZ"
>>>etc.
>>>
>>>The -append is my append-only weaves, and the -append-reuse is a version
>>>which keeps a mapping between lines that have existed, and their
>>>numbers, and if it gets the same line twice (even if it was created in
>>>different history), it reuses it (this is possible because all lines are
>>>looked up by number).
> 
> 
> Maybe I should mention that weave merge works on the principle that
> lines have three states: unborn -> alive -> dead.  If you're
> resurrecting lines, you'd probably need another layer (to treat
> resurrected lines as new lines) to do weave-merge.
> 

Well, the current implementation lets you just recreate the current 
weave format in memory.

> 
>>>However, the actual disk storage is 879k which is smaller than either of
>>>the weaves, and smaller than the tar (remember, tar uses 512byte blocks,
>>>and pads with null).
> 
> 
> OOOoh.  I was wondering why the tar file was more than twice the
> apparent-size.
> 
> 
>>>using a zipfile is another reasonable possibility, as the file is
>>>indexed, giving us a decent time to get to the entries. You can append
>>>to it (though if you crash in the middle, most readers will consider it
>>>completely corrupted).
> 
> 
> Wasn't it roughly the same as apparent-size?
> 

It is approximately apparent-size. But since du -sh is 7+MB, you can 
save quite bit of disk space. Whether people actually care or not, I'm 
not sure. But people who check using "du -sh" are going to think bzr is 
very wasteful.

> 
>>>I think the biggest reason we didn't switch, though, was because the
>>>time to add a new version is somewhere in O(N).
> 
> 
>>>Does anyone know of a python linked-list implementation?
> 
> 
> Dunno.  Actually, I was never certain whether the underlying
> implementation of list was a linked list or an array.

Well, if you just to:

$ python -m timeit "x = []" "for i in range(10000):" "  x.insert(0, i)"
10 loops, best of 3: 74.3 msec per loop
$ python -m timeit "x = []" "for i in range(10000):" "  x.append(i)"
100 loops, best of 3: 4.85 msec per loop

Which shows that inserting (which moves all elements) is 20x worse than 
appending. (Making it 100k rows is even worse).

> 
> 
>>>I don't know about Aaron's weavediff format, which let you only read
>>>portions of the file, rather than unpacking the entire weave into
>>>memory. It certainly could be < O(N) for inserting a new entry.
> 
> 
> Yes, but compression time would be O(n) scaling, so the complexity stays
> at O(n).  Also, you still need to go back as far as the oldest revision
> that contributed a line to the revision, which will often be the first
> revision.

Well, I'm talking about "compression" in the sense of putting everything 
in one file.

Also, I would think that it would be at most O(num_lines), where you 
probably can skip a bunch of the intermediate revisions that don't add 
any lines. (they've all been deleted)

John
=:->

> 
> Aaron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050930/c8c6ec7e/attachment.pgp 


More information about the bazaar mailing list