handling large files, draft 2

Sun Nov 2 18:38:06 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I wrote a while back about handling very large files via fragmentation.
> 
> There seem to be two rough approaches on the table:
> 
>  1) Teach versioned file to break a large file into smaller chunks. This
>     would presumably invoke having some facility in its index to 
>     put its chunks into.
>  2) Have the users of the versioned file objects know that a given file
>     has been fragmented. This needs some location for the users to go to
>     to request fragments.

Do we have to choose?  It seems like it would be easy enough to
implement 1) as a compatibility interface on top of 2).  That would
allow us to optimize important cases without overcomplicating other cases.

> Whatever we choose, I'd suggest that it apply to all files - not just to
> 'big ones' - so that as a file grows and shrinks it stays consistent.

Fully agreed.

> * commit performance: with proposal 1) it seems like we might want to 
>   increase the requirements of the file objects, to allow optimising 
>   around the internal facilities. 

Could you give some examples?

> * fetching: We don't really want to copy a full iso when part of an iso 
>   changes; as fetch works on object keys, I think we really want to 
>   expose the keys for the individual fragments. To me thats a factor
>   towards choosing 2)

I find that a little confusing.  When an ISO changes, we should only
need to copy the delta anyway.  Perhaps deltas should span fragments?

It's easy enough to imagine having one groupcompress "delta" for each
ISO fragment, but using the whole ISO image as the corpus.  It's also
easy to imagine generating multiple ISO "fragments" from a single
groupcompress "delta".

This approach also retains locality of reference, which I think you
haven't mentioned elsewhere.

> * diff: - for merge, annotate operations, being able to identify 
>   common regions (because the fragments haven't changed) could be very
>   useful.
> 
> There are a couple of large outstanding questions for me:
> 
>  - do we need a canonical form (like we have for split inventories). 
>    Now the split inventory case says that given an in memory inventory
>    I, its disk form is always D. For a large file, we'd say that there
>    is always some canonical layout on disk.

I'm inclined to think yes-- these fragments are intended to become a
single file, and that is its canonical form.

>  - Do we want to cache (whether in the same data structure, or a
>    derived-after-fetch-index) region mapping data - e.g. rsync
>    signatures.

If we use the whole corpus as input for deltas, this would just be a
hint to the compressor, and perhaps not useful.  If the fragmentation is
deeper, this might make a lot of sense.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkN84oACgkQ0F+nu1YWqI3s4gCfSDmHvHKEidUhlqWIKCa/Nhz7
RqUAn2GxH+j3nZlTtGl8NnKSA2jQaRzQ
=Ql8h
-----END PGP SIGNATURE-----