[RFC] Blob format

Wed Jun 28 16:41:28 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Johan Rydberg wrote:
> John Arbash Meinel <john at arbash-meinel.com> writes:
> 
>> [...]
>> And if we should be smart enough to not record a 16k comment string (I
>> think that is the max length. IIRC, it uses a 16-bit integer to store
>> the length, so it might be 32k or 65k if unsigned)
>> We know the average size of a single TOC entry. And I think it would be
>> reasonable to make the first request 8k or so. Maybe more, I'm not sure
>> what would be optimal for round-trip versus bandwidth.
> 
> Sounds like zipfiles are borked.  The upside with zipfiles, though, is
> that almost everything can read them.  No need for special tools.
> 
> But I do not really understand your round-trip vs bandwidth reasoning;
> if you want to fetch and install a blob into your local repository,
> you of course go ahead and fetch the whole file at first, before
> inspecting it.  Either directly into the store, or to a temporary
> directory.
> 

How do you know if anything is missing. If there is a 10MB blob sitting
there, you may have everything but 2MB of it.
The idea is that you can read the blob's TOC to see what is missing, and
then issue another readv() for just those hunks that you need.

> My current code has two fetchers:
> 
>  1. a copying fetcher that simply retrieves blobs from the remote
>     repository and installs them as is.
> 
>  2. a merging fetcher that retrieves remote blobs and combine them
>     into a new blob, and install that into the blob store.
> 
>     It does this by copying the retrieved blob into a local file
>     before starting to inspect it.  The reasoning behind this is
>     that _all_ content of the blob will be inspected, so you may
>     as well retrieve all content to begin with.
> 
>> Well, if we named everything properly, the final TOC could contain the
>> index of what revisions are present. Then we wouldn't have to maintain a
>> separate index.
> 
> I was thinking of a small index file that contains revision->blob
> mappings, so you do not have to iterate through all blobs to find a
> the revision you are looking for.
> 

I don't think this index file will end up 'small'. Depending on what you
want to index. I suppose if it is just a revision_id => blob filename
mapping, that might work.

One thing to think about, though, is if it is possible to only read part
of this index file.
Right now with knit indexes we read the whole thing. But theoretically
we could just read the last X bytes, and figure out if we need to read
any more for what we want to do.
This is mostly for the revision.knit/inventory.knit.
I think we need to get the branch's last_revision, and then if we don't
have it locally, we need to read revision.knit and inventory.knit until
we find it, and then figure out what ancestry we are missing.
Most times it will be a win to just read the last few K of the kndx
files, since you will have the rest of the data locally.

I think this is a nice property, and we should try to get the blob index
to perform similarly.

>> The other thing to watch out for with a zipfile, though, is that the
>> proper way to add data is to write over the current TOC with the new
>> file records, and then to regenerate the TOC when you are done. If
>> something messes you up in the middle, you've messed up the file. You
>> could fix it by re-scanning the file for "PK34" headers which indicate
>> the start of a new record, and hope that none of the compressed data
>> randomly generates a PK34 entry. Though you could also use the other
>> integrity fields to disregard this as a real entry.
> 
> If we make blobs immutable this is a non-problem since you will never
> open an existing blob for writing.  But that lives out signatures; if
> a user want to sign a revision that is contained in a blob, should we
> insert the signature into the blob; or should signature be stored in a
> separate signature store?

s/lives/leaves/ right?

Either immutable or append only is fine with me.

> 
>> I would be more interested in a 'solid' format like 7z, which compresses
>> multiple files per hunk, but keeps an index to the beginning of the
>> hunk. With tar.gz you have to seek through all files to find the one you
>> want. With 7z, you can bound how many files you have to seek through.
> 
> I did not know of 7z.  I will look at when I get a few minutes over.
> Do you know if there is a python port/SDK available?

I'm not sure. The website is http://www.7zip.org. There are some linux
tools, though the site is more focused on win32.
Unfortunately, the linux tools have 1 or 2 incompatible formats. I
believe it is because the original 7z format doesn't support streaming
(cat foo | gzip style). So one person implemented a streamable format,
but it isn't compatible with the other formats.

Most of the executable builders are switching to it. It gets better
compression that bzip2, and extracts faster. (Compression is quite a bit
slower). My understanding is that LZMA is basically the gzip algorithm,
with a different lookback algorithm (to give a large effective window),
a large dictionary size, and maybe algorithmic encoding instead of huffman.

Google turned up:
http://www.joachim-bauch.de/projects/python/pylzma/

But I can't vouch for it at all. The nice thing about gzip/bzip2 is they
are part of python standard library. But if we can get much better
compression with LZMA, I think it is worth investigating.

> 
>> Most likely, I would like to see us improve the Bundle format, so that
>> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
>> the rollup at the top. And then the bzip2 hunk could become a basis for
>> a blob storage format.
> 
> Optimally we could write a "blob serializer" that only includes the
> deltas.  I suppose this is a small matter of coding :)
> 
> ~j

Well, at some point you need a full-text. Or your O(N) scaling starts to
get a really big N.
Bundles get away with the fact they expect you to have a base revision
on the other side.
Blob storage doesn't have that luxury.
And anyone who tried the Arch mirror of the kernel could probably tell
you that building 10k deltas against an initial revision is *really*
painful.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEoqMoJdeBCYSNAAMRAnXzAJ9SNPL6u/c1km8zUQUKTwty4lsWlQCgykTq
E0tpmFovKCfFhPhroICttB8=
=oixp
-----END PGP SIGNATURE-----