handling large files, draft 2

Sun Nov 2 21:20:28 GMT 2008

On Sun, 2008-11-02 at 13:38 -0500, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > I wrote a while back about handling very large files via fragmentation.
> > 
> > There seem to be two rough approaches on the table:
> > 
> >  1) Teach versioned file to break a large file into smaller chunks. This
> >     would presumably invoke having some facility in its index to 
> >     put its chunks into.
> >  2) Have the users of the versioned file objects know that a given file
> >     has been fragmented. This needs some location for the users to go to
> >     to request fragments.
> 
> Do we have to choose?  It seems like it would be easy enough to
> implement 1) as a compatibility interface on top of 2).  That would
> allow us to optimize important cases without overcomplicating other cases.

Indeed, perhaps we don't have to choose.. just expose both layers
tastefully.

> > * commit performance: with proposal 1) it seems like we might want to 
> >   increase the requirements of the file objects, to allow optimising 
> >   around the internal facilities. 
> 
> Could you give some examples?

Currently the file objects used by commit just need single forward-read
of the entire file in binary mode. We don't need seeking. If we were to
use e.g. rsync logic in generating data during the commit, we'd be
wanting to read the file twice (once to generate the recipe, once to
copy the identified missing data). So not a huge thing to ask for.

> > * fetching: We don't really want to copy a full iso when part of an iso 
> >   changes; as fetch works on object keys, I think we really want to 
> >   expose the keys for the individual fragments. To me thats a factor
> >   towards choosing 2)
> 
> I find that a little confusing.  When an ISO changes, we should only
> need to copy the delta anyway.  Perhaps deltas should span fragments?

I think there are several threads here.
Firstly, if we requested an ISO revision via a single top level key, the
sender would not know whether we needed the full thing or some subset of
its fragments, unless the top level thing and all the fragments have
their own delta-fields which are guaranteed consistent with ancestry
like knits are today. This poses some challenges I think.

Secondly, one thing that git does for its packs which I quite like
(because of the robustness it gives) is that their deltas only ever
reference content within the same pack. This means you can block copy a
pack and be sure its all reconstructable. So I'm imagining how we can
achieve this (note that groupcompress does this intrinsically) without
sacrificing performance. Because groupcompress does this, we would end
up copying the full content of every fragment not in common with
whatever ancestor we consider the top level key to have. And if the
alteration to the iso is (say) 'remove the first byte and shift left'
then its very easy to have every single fragment altered; A
groupcompress run across both iso's would compress fantastically, but a
run across one would just be the full size.

> It's easy enough to imagine having one groupcompress "delta" for each
> ISO fragment, but using the whole ISO image as the corpus.  It's also
> easy to imagine generating multiple ISO "fragments" from a single
> groupcompress "delta".

This still requires the full ISO to be inside the groupcompress stream -
groupcompress is entirely self referential.

> This approach also retains locality of reference, which I think you
> haven't mentioned elsewhere.

Sure.

> > * diff: - for merge, annotate operations, being able to identify 
> >   common regions (because the fragments haven't changed) could be very
> >   useful.
> > 
> > There are a couple of large outstanding questions for me:
> > 
> >  - do we need a canonical form (like we have for split inventories). 
> >    Now the split inventory case says that given an in memory inventory
> >    I, its disk form is always D. For a large file, we'd say that there
> >    is always some canonical layout on disk.
> 
> I'm inclined to think yes-- these fragments are intended to become a
> single file, and that is its canonical form.

If you mean the bytestream that we'd read from disk, then you don't
quite get what I meant. What I mean is the exact set of fragments, in
their serialized form, comprises the canonical form. (By analogy with
split inventories - there are huge numbers of different trees that could
all represent one specific inventory, the 'canonical form' is the one
tree that we should always use, so that we get the same validator
everywhere').

> >  - Do we want to cache (whether in the same data structure, or a
> >    derived-after-fetch-index) region mapping data - e.g. rsync
> >    signatures.
> 
> If we use the whole corpus as input for deltas, this would just be a
> hint to the compressor, and perhaps not useful.  If the fragmentation is
> deeper, this might make a lot of sense.

I'd like to avoid holding anything linearly associated with the whole
corpus in memory during diff (720MB of text reduced by line checksums is
still 9 million checksums - if we had 4 byte checksums, 36MB of memory).
This of course may not be feasible to tackle in the first iteration.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081103/70d0277a/attachment.pgp