[RFC] Blob format

Wed Jun 28 16:30:32 BST 2006

John Arbash Meinel <john at arbash-meinel.com> writes:

> [...]
> And if we should be smart enough to not record a 16k comment string (I
> think that is the max length. IIRC, it uses a 16-bit integer to store
> the length, so it might be 32k or 65k if unsigned)
> We know the average size of a single TOC entry. And I think it would be
> reasonable to make the first request 8k or so. Maybe more, I'm not sure
> what would be optimal for round-trip versus bandwidth.

Sounds like zipfiles are borked.  The upside with zipfiles, though, is
that almost everything can read them.  No need for special tools.

But I do not really understand your round-trip vs bandwidth reasoning;
if you want to fetch and install a blob into your local repository,
you of course go ahead and fetch the whole file at first, before
inspecting it.  Either directly into the store, or to a temporary
directory.

My current code has two fetchers:

 1. a copying fetcher that simply retrieves blobs from the remote
    repository and installs them as is.

 2. a merging fetcher that retrieves remote blobs and combine them
    into a new blob, and install that into the blob store.

    It does this by copying the retrieved blob into a local file
    before starting to inspect it.  The reasoning behind this is
    that _all_ content of the blob will be inspected, so you may
    as well retrieve all content to begin with.

> Well, if we named everything properly, the final TOC could contain the
> index of what revisions are present. Then we wouldn't have to maintain a
> separate index.

I was thinking of a small index file that contains revision->blob
mappings, so you do not have to iterate through all blobs to find a
the revision you are looking for.

> The other thing to watch out for with a zipfile, though, is that the
> proper way to add data is to write over the current TOC with the new
> file records, and then to regenerate the TOC when you are done. If
> something messes you up in the middle, you've messed up the file. You
> could fix it by re-scanning the file for "PK34" headers which indicate
> the start of a new record, and hope that none of the compressed data
> randomly generates a PK34 entry. Though you could also use the other
> integrity fields to disregard this as a real entry.

If we make blobs immutable this is a non-problem since you will never
open an existing blob for writing.  But that lives out signatures; if
a user want to sign a revision that is contained in a blob, should we
insert the signature into the blob; or should signature be stored in a
separate signature store?

> I would be more interested in a 'solid' format like 7z, which compresses
> multiple files per hunk, but keeps an index to the beginning of the
> hunk. With tar.gz you have to seek through all files to find the one you
> want. With 7z, you can bound how many files you have to seek through.

I did not know of 7z.  I will look at when I get a few minutes over.
Do you know if there is a python port/SDK available?

> Most likely, I would like to see us improve the Bundle format, so that
> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
> the rollup at the top. And then the bzip2 hunk could become a basis for
> a blob storage format.

Optimally we could write a "blob serializer" that only includes the
deltas.  I suppose this is a small matter of coding :)

~j