handling large files, draft 2

Thu Oct 30 01:37:41 GMT 2008

I wrote a while back about handling very large files via fragmentation.

There seem to be two rough approaches on the table:

 1) Teach versioned file to break a large file into smaller chunks. This
    would presumably invoke having some facility in its index to 
    put its chunks into.
 2) Have the users of the versioned file objects know that a given file
    has been fragmented. This needs some location for the users to go to
    to request fragments.

Choosing between these might be a matter of taste, unless we can get
some concrete benefits from one approach or another. Both involve a tree
structure under the hood.

Whatever we choose, I'd suggest that it apply to all files - not just to
'big ones' - so that as a file grows and shrinks it stays consistent.

So things to consider then:
* hash cache validator: If we have a tree we can take the hash of the 
  root of the tree.  This can be better than hashing the content because
  a change to part of the content doesn't require reading the entire
  thing. OTOH having the full content hash can be useful for other tools
  I guess. It can also be cheaper to only have one record for the 
  content rather than needing both a tree pointer and a separate hash.
  proposal 1) would need a new API to access the tree's hash, proposal
  2) make it explicit, but would need extra effort to get a plain hash 
  of the contents. On the gripping hand, as the tree hash will depend on
  the exact tree structure, this probably precludes having the storage
  depend on the existing tree data.
* commit performance: with proposal 1) it seems like we might want to 
  increase the requirements of the file objects, to allow optimising 
  around the internal facilities. With 2) commit itself would know how
  to drive the changed structures.
* fetching: We don't really want to copy a full iso when part of an iso 
  changes; as fetch works on object keys, I think we really want to 
  expose the keys for the individual fragments. To me thats a factor
  towards choosing 2), but perhaps its just a layering issue - we could
  have most of the code work with approach 1), but fetch and other 
  repository-level operations deal with 2). Total transfer will be 
  a factor too- currently we have a minimal diff to transfer (at the
  cost of having no way to work with part of the file), but with a 
  fragmented approach we'll transfer all the fragments not in common -
  a more below.
* diff: - for merge, annotate operations, being able to identify 
  common regions (because the fragments haven't changed) could be very
  useful.

There are a couple of large outstanding questions for me:

 - do we need a canonical form (like we have for split inventories). 
   Now the split inventory case says that given an in memory inventory
   I, its disk form is always D. For a large file, we'd say that there
   is always some canonical layout on disk.

   I'm leaning towards 'yes' for this, as it seems to add a nice 
   robustness to the system; it keeps the focus on snapshots. More 
   thought needed though.

 - Do we want to cache (whether in the same data structure, or a
   derived-after-fetch-index) region mapping data - e.g. rsync
   signatures.

   I think we can likely improve our sequence matching using ideas from
   rsync, and that this would be cachable.

 - When two files have quite different fragment lists, how do we fetch
   a minimum amount of data to get file the new data given one of the 
   files and wanting the other.

-Rob

- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081030/ca5883b8/attachment.pgp