backward diffs in knits?
John Arbash Meinel
john at arbash-meinel.com
Tue Apr 10 15:54:10 BST 2007
Matthieu Moy wrote:
...
>> For a forward delta, this produces
...
>
> A snapshot of "X" at 1
> An insert of "X" at n \in 2..10
> A snapshot of XXXXXXXXXXX at 11
> An insert of "X" at n \in 12..15
>
>> For a backward delta, this produces
...
>
> A snapshot of X^15 at 15
> A snapshot of X^5 at 5
> A deletion at n \in 1..4, 6..14
>
>> You can extend this example in either direction, and backwards deltas
>> retain their three-letter advantage (but the relative advantage decreases).
>
> The relative advantage decreases _if_ you extend the example with
> no-ops. But if you extend it with insertion, the relative example
> should be constant (proportional to the snapshot interval).
>
>> It seems like there's some win, but I'm not sure whether it's enough to
>> justify a "repack" command.
>
> Probably not in itself.
>
> Indeed, what I'd like to see is the equivalent of git's pack files,
> which would provide a way to get a fast initial checkout over a dumb
> protocol.
>
I think that in the common case of files growing, backwards deltas are
slightly smaller, and it optimizes for extracting the TIP revision,
which helps for things like "bzr checkout", "bzr update", and "bzr diff".
When I was discussing this stuff with Martin and Robert, I didn't think
the idea was to actually store backwards deltas in Knits, but a
different storage format.
I don't think backwards deltas should be generated "on-the-fly", because
of the potential for a new corruption to lose old data.
However, it could make sense for an "archive" command. Where you
packaged up a bunch of older revisions to make them more dense, and sort
of put them off to the side. They should still be accessible, but maybe
they wouldn't be in the default search every time.
I do think it should be genuinely evaluated with real data before we
settle on anything. (Just do a backwards delta compression on bzr.dev's
repository and see how it changes things). It is unfortunate that you
need to do a bit of coding before you can know whether it is worth
finishing.
Another thing which seems to help git, is that they can do cross-file
deltas. So when they 'repack', they actually have some sort of
"similarity" index between files. My first impression was that it is
just whether the files have the same name across directories. (So
Makefile gets compacted well, but foo.c and bar.c don't get joined).
Which I would guess is fairly kernel-centric. I don't know if many
projects have commonly named files in different directories other than
the build files (Makefile.in, Makefile.am, etc)
I don't know how much compression one could get. But at least the Xorg
folks said that their packed history is approx the same size as a checkout.
And I can say that bzr.dev is 54MB for .bzr/ and 11MB for working files
(including .pyc, etc). So I'm curious if we can do better than 5x.
1x would make it trivial to accept downloading the whole history versus
just having the working files. (Considering you can download the history
and then generate the files locally). Though admittedly we won't ever be
in the size of a .tar.bz2 of the project.
Speaking of which, we might also consider using .7z as a storage format.
I think I've mentioned it before, but it sort of hybrids between .zip
and .tar. In that it can compress nearby files, but it still has an
index. So to extract "foo.py" might require extracting a chunk with
other data, but you don't have to extract the entire file. (Plus LZMA
has better compression characteristics that bz2. Higher compression, and
faster decompression speed, at the expense of slower compression
speed... good for archives)
John
=:->
More information about the bazaar
mailing list