[RFC] Blob format

Wed Jun 28 17:49:32 BST 2006

John Arbash Meinel <john at arbash-meinel.com> writes:

>> But I do not really understand your round-trip vs bandwidth reasoning;
>> if you want to fetch and install a blob into your local repository,
>> you of course go ahead and fetch the whole file at first, before
>> inspecting it.  Either directly into the store, or to a temporary
>> directory.
>> 
>
> How do you know if anything is missing. If there is a 10MB blob sitting
> there, you may have everything but 2MB of it.
> The idea is that you can read the blob's TOC to see what is missing, and
> then issue another readv() for just those hunks that you need.

Good call.

>> My current code has two fetchers:
>> 
>>  1. a copying fetcher that simply retrieves blobs from the remote
>>     repository and installs them as is.
>> 
>>  2. a merging fetcher that retrieves remote blobs and combine them
>>     into a new blob, and install that into the blob store.
>> 
>>     It does this by copying the retrieved blob into a local file
>>     before starting to inspect it.  The reasoning behind this is
>>     that _all_ content of the blob will be inspected, so you may
>>     as well retrieve all content to begin with.
>> 
>>> Well, if we named everything properly, the final TOC could contain the
>>> index of what revisions are present. Then we wouldn't have to maintain a
>>> separate index.
>> 
>> I was thinking of a small index file that contains revision->blob
>> mappings, so you do not have to iterate through all blobs to find a
>> the revision you are looking for.
>> 
>
> I don't think this index file will end up 'small'. Depending on what you
> want to index. I suppose if it is just a revision_id => blob filename
> mapping, that might work.

That was the original idea, but ..

The normal access pattern while doing a pull is to fetch the revision
history file (to determine last_revision) and then to fetch the
ancestry of the remote repository to calculate what revisions need to
be fetched.  

So fast access to the ancestry is important.  There for I might want
to include it in the index.

> One thing to think about, though, is if it is possible to only read part
> of this index file.
> Right now with knit indexes we read the whole thing. But theoretically
> we could just read the last X bytes, and figure out if we need to read
> any more for what we want to do.
> This is mostly for the revision.knit/inventory.knit.
> I think we need to get the branch's last_revision, and then if we don't
> have it locally, we need to read revision.knit and inventory.knit until
> we find it, and then figure out what ancestry we are missing.
> Most times it will be a win to just read the last few K of the kndx
> files, since you will have the rest of the data locally.
>
> I think this is a nice property, and we should try to get the blob index
> to perform similarly.

Absolutely.  We would be a lot better for if we had fixed-size file
and revision identifiers, but it is a bit too late to change that.  So
we probably need format with back-pointers to earlier entries in the
index.  

last_revision is _most likely_ (unless it is an very old branch in a
shared repository) located at the end of the index, so it should
probably be enough to starting out with reading -N bytes, and then
doubling N for each time the entry was not found.

HTTP byte ranges supports negative offsets.  Not sure our readv()
interface does though.

>>> The other thing to watch out for with a zipfile, though, is that the
>>> proper way to add data is to write over the current TOC with the new
>>> file records, and then to regenerate the TOC when you are done. If
>>> something messes you up in the middle, you've messed up the file. You
>>> could fix it by re-scanning the file for "PK34" headers which indicate
>>> the start of a new record, and hope that none of the compressed data
>>> randomly generates a PK34 entry. Though you could also use the other
>>> integrity fields to disregard this as a real entry.
>> 
>> If we make blobs immutable this is a non-problem since you will never
>> open an existing blob for writing.  But that lives out signatures; if
>> a user want to sign a revision that is contained in a blob, should we
>> insert the signature into the blob; or should signature be stored in a
>> separate signature store?
>
> s/lives/leaves/ right?

Correct.  Sorry.

> Either immutable or append only is fine with me.

When been thinking about it for a while, we are probably better of
with our own format of some sort.  It does not have to be very complex.

The blob starts out with "file" contents, after that follows a section
of per-file headers (name, crc, ...) and last a general file header.
The general file header can contain a back-pointer to an earlier
general header in the file, to implement append.  Example:

    offset 0:    <file 'foo' content>
    offset 300:  <file 'bar' content>
    offset 600:  <header 'foo', offset=0, size=300>
    offset 6XX:  <header 'bar', offset=300, size=300>
    offset 6XX:  <general header; index-offset=600>

    (though, for the indexes, it might be better with relative
     offsets)

To get a index of the blob, read -N bytes, verify general header, and
start reading the per-file headers.  If -N is not enough, do another
access to get more data.

> But I can't vouch for it at all. The nice thing about gzip/bzip2 is they
> are part of python standard library. But if we can get much better
> compression with LZMA, I think it is worth investigating.

We could use any compression algorithm with the scheme outlined above.

>>> Most likely, I would like to see us improve the Bundle format, so that
>>> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
>>> the rollup at the top. And then the bzip2 hunk could become a basis for
>>> a blob storage format.
>> 
>> Optimally we could write a "blob serializer" that only includes the
>> deltas.  I suppose this is a small matter of coding :)
>
> Well, at some point you need a full-text. Or your O(N) scaling starts to
> get a really big N.

Either that, or use skip-deltas.  

Actually, regarding the "blob serializer"; I was mostly thinking in
respect to the idea of having the blob format tightly coupled with
knits, and only use blobs as a "network format" with quick fetchers
to and from knits.

~j