[RFC] Blob format
Johan Rydberg
jrydberg at gnu.org
Wed Jun 28 17:49:32 BST 2006
John Arbash Meinel <john at arbash-meinel.com> writes:
>> But I do not really understand your round-trip vs bandwidth reasoning;
>> if you want to fetch and install a blob into your local repository,
>> you of course go ahead and fetch the whole file at first, before
>> inspecting it. Either directly into the store, or to a temporary
>> directory.
>>
>
> How do you know if anything is missing. If there is a 10MB blob sitting
> there, you may have everything but 2MB of it.
> The idea is that you can read the blob's TOC to see what is missing, and
> then issue another readv() for just those hunks that you need.
Good call.
>> My current code has two fetchers:
>>
>> 1. a copying fetcher that simply retrieves blobs from the remote
>> repository and installs them as is.
>>
>> 2. a merging fetcher that retrieves remote blobs and combine them
>> into a new blob, and install that into the blob store.
>>
>> It does this by copying the retrieved blob into a local file
>> before starting to inspect it. The reasoning behind this is
>> that _all_ content of the blob will be inspected, so you may
>> as well retrieve all content to begin with.
>>
>>> Well, if we named everything properly, the final TOC could contain the
>>> index of what revisions are present. Then we wouldn't have to maintain a
>>> separate index.
>>
>> I was thinking of a small index file that contains revision->blob
>> mappings, so you do not have to iterate through all blobs to find a
>> the revision you are looking for.
>>
>
> I don't think this index file will end up 'small'. Depending on what you
> want to index. I suppose if it is just a revision_id => blob filename
> mapping, that might work.
That was the original idea, but ..
The normal access pattern while doing a pull is to fetch the revision
history file (to determine last_revision) and then to fetch the
ancestry of the remote repository to calculate what revisions need to
be fetched.
So fast access to the ancestry is important. There for I might want
to include it in the index.
> One thing to think about, though, is if it is possible to only read part
> of this index file.
> Right now with knit indexes we read the whole thing. But theoretically
> we could just read the last X bytes, and figure out if we need to read
> any more for what we want to do.
> This is mostly for the revision.knit/inventory.knit.
> I think we need to get the branch's last_revision, and then if we don't
> have it locally, we need to read revision.knit and inventory.knit until
> we find it, and then figure out what ancestry we are missing.
> Most times it will be a win to just read the last few K of the kndx
> files, since you will have the rest of the data locally.
>
> I think this is a nice property, and we should try to get the blob index
> to perform similarly.
Absolutely. We would be a lot better for if we had fixed-size file
and revision identifiers, but it is a bit too late to change that. So
we probably need format with back-pointers to earlier entries in the
index.
last_revision is _most likely_ (unless it is an very old branch in a
shared repository) located at the end of the index, so it should
probably be enough to starting out with reading -N bytes, and then
doubling N for each time the entry was not found.
HTTP byte ranges supports negative offsets. Not sure our readv()
interface does though.
>>> The other thing to watch out for with a zipfile, though, is that the
>>> proper way to add data is to write over the current TOC with the new
>>> file records, and then to regenerate the TOC when you are done. If
>>> something messes you up in the middle, you've messed up the file. You
>>> could fix it by re-scanning the file for "PK34" headers which indicate
>>> the start of a new record, and hope that none of the compressed data
>>> randomly generates a PK34 entry. Though you could also use the other
>>> integrity fields to disregard this as a real entry.
>>
>> If we make blobs immutable this is a non-problem since you will never
>> open an existing blob for writing. But that lives out signatures; if
>> a user want to sign a revision that is contained in a blob, should we
>> insert the signature into the blob; or should signature be stored in a
>> separate signature store?
>
> s/lives/leaves/ right?
Correct. Sorry.
> Either immutable or append only is fine with me.
When been thinking about it for a while, we are probably better of
with our own format of some sort. It does not have to be very complex.
The blob starts out with "file" contents, after that follows a section
of per-file headers (name, crc, ...) and last a general file header.
The general file header can contain a back-pointer to an earlier
general header in the file, to implement append. Example:
offset 0: <file 'foo' content>
offset 300: <file 'bar' content>
offset 600: <header 'foo', offset=0, size=300>
offset 6XX: <header 'bar', offset=300, size=300>
offset 6XX: <general header; index-offset=600>
(though, for the indexes, it might be better with relative
offsets)
To get a index of the blob, read -N bytes, verify general header, and
start reading the per-file headers. If -N is not enough, do another
access to get more data.
> But I can't vouch for it at all. The nice thing about gzip/bzip2 is they
> are part of python standard library. But if we can get much better
> compression with LZMA, I think it is worth investigating.
We could use any compression algorithm with the scheme outlined above.
>>> Most likely, I would like to see us improve the Bundle format, so that
>>> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
>>> the rollup at the top. And then the bzip2 hunk could become a basis for
>>> a blob storage format.
>>
>> Optimally we could write a "blob serializer" that only includes the
>> deltas. I suppose this is a small matter of coding :)
>
> Well, at some point you need a full-text. Or your O(N) scaling starts to
> get a really big N.
Either that, or use skip-deltas.
Actually, regarding the "blob serializer"; I was mostly thinking in
respect to the idea of having the blob format tightly coupled with
knits, and only use blobs as a "network format" with quick fetchers
to and from knits.
~j
More information about the bazaar
mailing list