[RFC] Blob format

Wed Jun 28 15:14:19 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Johan Rydberg wrote:
> As some of you know already I've been toying with the idea of a blob
> history format.  I wrote down a few lines to start a discussion:
> 
>   http://bazaar-vcs.org/BzrBlobFormat
> 
> There is also a branch available where I've done some work (see the
> spec for URL.)
> 
> The golden goal of the blob format is to eliminate a lot of RTT's to a
> dumb remote server, and try to get rid of the problem with identifying
> what to fetch.
> 
> My current work is focusing on having a model where a blob can contain
> several revisions.  Blobs are immutable, but can be combined into a
> new blob.  This is done while doing a push/pull/merge.  Meaning the
> number of blobs will be reduces when data is shared between repos.
> 
> The blob also contains meta-data about what changes a revision
> introduces to a file, to speed up operation such as 'bzr log FILE'
> (i.e., Repository.get_revision_delta)
> 
> Currently I use simple .zip-files as containers for the blobs. That
> may be far from optimal though, since I got the feeling that you need
> to skip a lot to read the contents from a zipfile.  So a custom made
> container may be more suitable.

Well, from my experience with zipfile, you read the very tail of the
file to find out where the main index is. They did something smart and
put the pointer to the index as the last thing, but then they screwed up
and followed it with a variable size comment field, it has a length
field, but *before* the comment string. So you have to read a couple of
K, and look for the "PK56" header so that you can find where the index
starts, jump back to there, and read to the end of the file.
At that point, you should have the table of contents to tell you where
all the other hunks are.
Then you can jump around to the other hunks, read in the short header
there, verify it, and then read the rest of the hunk.

So a smarter implementation would guess at the size of the table of
contents, and just read a few K off the end of the file, and hope that
it got enough that it wouldn't have to go back and request more of the
tail so it could get the table of contents.

And if we should be smart enough to not record a 16k comment string (I
think that is the max length. IIRC, it uses a 16-bit integer to store
the length, so it might be 32k or 65k if unsigned)
We know the average size of a single TOC entry. And I think it would be
reasonable to make the first request 8k or so. Maybe more, I'm not sure
what would be optimal for round-trip versus bandwidth.

> 
> Another open question is how to store information in the blob store
> about what blobs contains what revisions.  One simple way would be to
> have an index of revision->blob mappings.  That index could also
> contain parent information for the revisions, to speed up repository
> graph and ancestry retrieval.  The downside is that when inserting a
> blob in the blob-store it would have to be inspected to figure out
> what revisions are present.  That data could possible be deduced from
> the fetching process, though,
> 
> ~j

Well, if we named everything properly, the final TOC could contain the
index of what revisions are present. Then we wouldn't have to maintain a
separate index.

The other thing to watch out for with a zipfile, though, is that the
proper way to add data is to write over the current TOC with the new
file records, and then to regenerate the TOC when you are done. If
something messes you up in the middle, you've messed up the file. You
could fix it by re-scanning the file for "PK34" headers which indicate
the start of a new record, and hope that none of the compressed data
randomly generates a PK34 entry. Though you could also use the other
integrity fields to disregard this as a real entry.

It might also be possible to start writing *after* the current TOC, and
when you finish write a new TOC. You end up leaving garbage in the file,
which may not be small if you have thousands of individual files in there.

I would be more interested in a 'solid' format like 7z, which compresses
multiple files per hunk, but keeps an index to the beginning of the
hunk. With tar.gz you have to seek through all files to find the one you
want. With 7z, you can bound how many files you have to seek through.

Most likely, I would like to see us improve the Bundle format, so that
it includes a bzip2 + base64 of all the knit deltas at the bottom, with
the rollup at the top. And then the bzip2 hunk could become a basis for
a blob storage format.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEoo67JdeBCYSNAAMRAv5wAJ9Kvg8xLh61f5OODYGj/yQZF7ihwQCeMnAF
ts+sD6o99LH/ECiL6wtosdc=
=2hRZ
-----END PGP SIGNATURE-----