[MERGE-REQ] bzr.newformat

Fri Sep 16 09:56:35 BST 2005

> > I've checked out the weave branch, and had a general overview
> > on it.  The new storage scheme looks very promising indeed.
> > One point I've found a bit dubious though is how the storage
> > format depends on newlines as an enforced boundary for each
> > chunk. This would turn the format into something a bit unwieldy
> > for binary files, becoming impossible, for instance, to have
> > chunks of fixed sizes.
> 
> For text files obviously you do want to split it on newlines, as the
> sensible unit for annotation and merging.

Certainly.

> The weave code just works on a sequence of strings, without any
> requirement that they be terminated by newlines or even printable.  It
> should work fine with binary files (though it needs more tests).  You
> do need some way to chunk the binary file; \n will work ok on many
> files but might give uneven chunk sizes on some.  Better would
> probably be to use a rolling checksum.

What worries me is that the chunk unit for the file format depends
on newlines. I can't just pass it a chunk (line) with an embeded
newline and hope it will work. Just to make the point clear, think
how the weave file would look like in the pathological case of a 1MB
file containing just newlines. With the proposed change, once we're
able to identify a binary file in bzr, we could at least split in
specific chunk sizes. Using a rolling checksum would certainly be
a plus.

> The current storage format is also line oriented, but uses ',' to mark
> data lines with no trailing newline present.  This seems to work OK to
> store binaries.  Doing it this way seemed reasonably efficient in
> Python, and has the advantage of making the weaves more human
> readable.

Yes, it looks really interesting.

Thanks for checking it out.

-- 
Gustavo Niemeyer
http://niemeyer.net