[RFC] Blob format
Johan Rydberg
jrydberg at gnu.org
Wed Jun 28 16:30:32 BST 2006
John Arbash Meinel <john at arbash-meinel.com> writes:
> [...]
> And if we should be smart enough to not record a 16k comment string (I
> think that is the max length. IIRC, it uses a 16-bit integer to store
> the length, so it might be 32k or 65k if unsigned)
> We know the average size of a single TOC entry. And I think it would be
> reasonable to make the first request 8k or so. Maybe more, I'm not sure
> what would be optimal for round-trip versus bandwidth.
Sounds like zipfiles are borked. The upside with zipfiles, though, is
that almost everything can read them. No need for special tools.
But I do not really understand your round-trip vs bandwidth reasoning;
if you want to fetch and install a blob into your local repository,
you of course go ahead and fetch the whole file at first, before
inspecting it. Either directly into the store, or to a temporary
directory.
My current code has two fetchers:
1. a copying fetcher that simply retrieves blobs from the remote
repository and installs them as is.
2. a merging fetcher that retrieves remote blobs and combine them
into a new blob, and install that into the blob store.
It does this by copying the retrieved blob into a local file
before starting to inspect it. The reasoning behind this is
that _all_ content of the blob will be inspected, so you may
as well retrieve all content to begin with.
> Well, if we named everything properly, the final TOC could contain the
> index of what revisions are present. Then we wouldn't have to maintain a
> separate index.
I was thinking of a small index file that contains revision->blob
mappings, so you do not have to iterate through all blobs to find a
the revision you are looking for.
> The other thing to watch out for with a zipfile, though, is that the
> proper way to add data is to write over the current TOC with the new
> file records, and then to regenerate the TOC when you are done. If
> something messes you up in the middle, you've messed up the file. You
> could fix it by re-scanning the file for "PK34" headers which indicate
> the start of a new record, and hope that none of the compressed data
> randomly generates a PK34 entry. Though you could also use the other
> integrity fields to disregard this as a real entry.
If we make blobs immutable this is a non-problem since you will never
open an existing blob for writing. But that lives out signatures; if
a user want to sign a revision that is contained in a blob, should we
insert the signature into the blob; or should signature be stored in a
separate signature store?
> I would be more interested in a 'solid' format like 7z, which compresses
> multiple files per hunk, but keeps an index to the beginning of the
> hunk. With tar.gz you have to seek through all files to find the one you
> want. With 7z, you can bound how many files you have to seek through.
I did not know of 7z. I will look at when I get a few minutes over.
Do you know if there is a python port/SDK available?
> Most likely, I would like to see us improve the Bundle format, so that
> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
> the rollup at the top. And then the bzip2 hunk could become a basis for
> a blob storage format.
Optimally we could write a "blob serializer" that only includes the
deltas. I suppose this is a small matter of coding :)
~j
More information about the bazaar
mailing list