[RFC] Blob format

Wed Jun 28 18:19:44 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Johan Rydberg wrote:
> John Arbash Meinel <john at arbash-meinel.com> writes:
> 
>>> But I do not really understand your round-trip vs bandwidth reasoning;
>>> if you want to fetch and install a blob into your local repository,
>>> you of course go ahead and fetch the whole file at first, before
>>> inspecting it.  Either directly into the store, or to a temporary
>>> directory.
>>>
>> How do you know if anything is missing. If there is a 10MB blob sitting
>> there, you may have everything but 2MB of it.
>> The idea is that you can read the blob's TOC to see what is missing, and
>> then issue another readv() for just those hunks that you need.
> 
> Good call.
> 
>>> My current code has two fetchers:
>>>
>>>  1. a copying fetcher that simply retrieves blobs from the remote
>>>     repository and installs them as is.
>>>
>>>  2. a merging fetcher that retrieves remote blobs and combine them
>>>     into a new blob, and install that into the blob store.
>>>
>>>     It does this by copying the retrieved blob into a local file
>>>     before starting to inspect it.  The reasoning behind this is
>>>     that _all_ content of the blob will be inspected, so you may
>>>     as well retrieve all content to begin with.
>>>
>>>> Well, if we named everything properly, the final TOC could contain the
>>>> index of what revisions are present. Then we wouldn't have to maintain a
>>>> separate index.
>>> I was thinking of a small index file that contains revision->blob
>>> mappings, so you do not have to iterate through all blobs to find a
>>> the revision you are looking for.
>>>
>> I don't think this index file will end up 'small'. Depending on what you
>> want to index. I suppose if it is just a revision_id => blob filename
>> mapping, that might work.
> 
> That was the original idea, but ..
> 
> The normal access pattern while doing a pull is to fetch the revision
> history file (to determine last_revision) and then to fetch the
> ancestry of the remote repository to calculate what revisions need to
> be fetched.  
> 
> So fast access to the ancestry is important.  There for I might want
> to include it in the index.
> 
>> One thing to think about, though, is if it is possible to only read part
>> of this index file.
>> Right now with knit indexes we read the whole thing. But theoretically
>> we could just read the last X bytes, and figure out if we need to read
>> any more for what we want to do.
>> This is mostly for the revision.knit/inventory.knit.
>> I think we need to get the branch's last_revision, and then if we don't
>> have it locally, we need to read revision.knit and inventory.knit until
>> we find it, and then figure out what ancestry we are missing.
>> Most times it will be a win to just read the last few K of the kndx
>> files, since you will have the rest of the data locally.
>>
>> I think this is a nice property, and we should try to get the blob index
>> to perform similarly.
> 
> Absolutely.  We would be a lot better for if we had fixed-size file
> and revision identifiers, but it is a bit too late to change that.  So
> we probably need format with back-pointers to earlier entries in the
> index.  

That is possible, but realize it is much better to read 10k all at once,
rather than reading 200 bytes, seeking back 200 bytes, reading another
200 bytes, etc.

So while backpointers are good, just knowing that it is earlier than the
current record means that you just jump back and read another 10k.

I don't know where the specific tradeoff would come, but for local disk
access the granularity is at the very least a disk block, and more
likely an entire stripe. So you are talking 4k -> 100+k.
hg revlog files actually contain the data inside the index in the case
the whole file is < 128k. So it is potentially a single disk read to get
everything, rather than jumping around a lot.

For remote access to an HTTP file, it depends on latency versus
bandwidth, but my bandwidth is 160KB/s, and my latency to bazaar-vcs.org
bounces between 100ms and 400ms.
So at 100ms, I can download 16K, and at 400ms I can download 65K.

So unless your back pointer is pointing further back than 16Kbytes, it
is better to just leave it out, and read a bigger chunk.

I think I have a moderate bandwidth system. I think we could definitely
tune round-trip algorithms for around 20-30KB per read and do quite well.

> 
> last_revision is _most likely_ (unless it is an very old branch in a
> shared repository) located at the end of the index, so it should
> probably be enough to starting out with reading -N bytes, and then
> doubling N for each time the entry was not found.
> 
> HTTP byte ranges supports negative offsets.  Not sure our readv()
> interface does though.

Unfortunately, it doesn't. But I think we should make it. file.seek()
can do negative offsets, and I believe sftp can as well. (I'm not
positive on that one, though).

And since we want to support reading just the end of index files,
readv() should be changed. You have my blessing to do so :)

I filed:
https://launchpad.net/products/bzr/+bug/51218

To make sure we don't forget.

> 
>>>> The other thing to watch out for with a zipfile, though, is that the
>>>> proper way to add data is to write over the current TOC with the new
>>>> file records, and then to regenerate the TOC when you are done. If
>>>> something messes you up in the middle, you've messed up the file. You
>>>> could fix it by re-scanning the file for "PK34" headers which indicate
>>>> the start of a new record, and hope that none of the compressed data
>>>> randomly generates a PK34 entry. Though you could also use the other
>>>> integrity fields to disregard this as a real entry.
>>> If we make blobs immutable this is a non-problem since you will never
>>> open an existing blob for writing.  But that lives out signatures; if
>>> a user want to sign a revision that is contained in a blob, should we
>>> insert the signature into the blob; or should signature be stored in a
>>> separate signature store?
>> s/lives/leaves/ right?
> 
> Correct.  Sorry.
> 
>> Either immutable or append only is fine with me.
> 
> When been thinking about it for a while, we are probably better of
> with our own format of some sort.  It does not have to be very complex.
> 
> The blob starts out with "file" contents, after that follows a section
> of per-file headers (name, crc, ...) and last a general file header.
> The general file header can contain a back-pointer to an earlier
> general header in the file, to implement append.  Example:
> 
>     offset 0:    <file 'foo' content>
>     offset 300:  <file 'bar' content>
>     offset 600:  <header 'foo', offset=0, size=300>
>     offset 6XX:  <header 'bar', offset=300, size=300>
>     offset 6XX:  <general header; index-offset=600>
> 
>     (though, for the indexes, it might be better with relative
>      offsets)
> 
> To get a index of the blob, read -N bytes, verify general header, and
> start reading the per-file headers.  If -N is not enough, do another
> access to get more data.

Sure. Just keep in mind what I mentioned. That sometimes repeating
yourself is a large net gain, versus having to seek and read again.
If you prefer, we could do:

offset 6XX: <general headers: index-offset=300,600,800,900> etc.
So you have one 'read -N bytes' to get the first set of indexes, and
then a single readv() to get the rest.

> 
>> But I can't vouch for it at all. The nice thing about gzip/bzip2 is they
>> are part of python standard library. But if we can get much better
>> compression with LZMA, I think it is worth investigating.
> 
> We could use any compression algorithm with the scheme outlined above.
>  
>>>> Most likely, I would like to see us improve the Bundle format, so that
>>>> it includes a bzip2 + base64 of all the knit deltas at the bottom, with
>>>> the rollup at the top. And then the bzip2 hunk could become a basis for
>>>> a blob storage format.
>>> Optimally we could write a "blob serializer" that only includes the
>>> deltas.  I suppose this is a small matter of coding :)
>> Well, at some point you need a full-text. Or your O(N) scaling starts to
>> get a really big N.
> 
> Either that, or use skip-deltas.  
> 
> Actually, regarding the "blob serializer"; I was mostly thinking in
> respect to the idea of having the blob format tightly coupled with
> knits, and only use blobs as a "network format" with quick fetchers
> to and from knits.
> 
> ~j

Well, I don't know that the current knit implementation is going to be
the end-all be-all. It would be nice if they weren't quite so line
oriented (more byte oriented). And we might look into doing hg style
'index mixed with data' for small files.
I *think* that is a big win for kernel sized trees when you have lots of
small files that don't change much.

A simple du -ka | sort -n tells me we have ~1000 files < 10k out of 1500
files.

That includes directories that group up files, so it isn't super
accurate, but it does tell you a large portion would fit very cheaply.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEorowJdeBCYSNAAMRAoUxAKC7QgEh4ZTnTsXgnv1BQH+BnVpckgCeORKd
J+kGnEI+PzhMbrcS8jv23oA=
=Bvih
-----END PGP SIGNATURE-----