RFC: handlings large files via fragmenting

Mon Aug 25 23:43:59 BST 2008

On Mon, 2008-08-25 at 08:53 -0400, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > Either push some metadata about content objects out of the inventory
> > into a per-object header, or define a new inventory entry type
> > 'bigfile'. (Defining a new type is easiest). For a bigfile, the
> > referenced content is not the file content, instead its the root of a
> > tree containing the file content in a number of nodes. Each node could
> > be quite large - say 50MB. For merge, a change from file->bigfile can
> > attempt to merge much like a file->file merge.
> 
> This sounds a lot like something I suggested in Istanbul, but I was
> proposing to do it at a lower level-- at the repository or
> versionedfiles level.

Yes, I recall that discussion. This is primarily different in the layer
that it is exposed at. Its not new either - Fossil w/Venti does this for
large files.

> I was trying to solve the problem that reads or writes of file texts
> cause the whole file to be held in memory, for delta compression or text
> reconstruction purposes.  So for example, we could have a 10 MB ceiling
> on the cost of reconstructing a text.

Indeed, it would be good to fix that bug.

> So my idea would not have helped merge or diff directly, but would have
> no effect on Repository-and-higher APIs.
> 
> Once we have efficient read and write, we can work on making diff and
> merge more memory-efficient.  We could apply the techniques you describe
> below, simply by reading 10M at a time, and handling them as your
> "fragments".

One difficulty that would arise is that we'd need to offer a 'seek'
idiom to move between parts of a file. That is, if we hide this from
high layers they will have lower memory use, but have to 'open' and
'seek' to get to a particular region of the file without fully
decompressing it. If we expose this to higher layers, perhaps they can
do more sophisticated things and reduce overall IO and processing. 

I don't have a firm answer in my head about this. I think there is some
elegance in not exposing 'how' files are stored, but on the other hand
ISO's and other files larger than 1/4 of main memory really are
exceptional circumstances to handle; so perhaps its ok to handle them
exceptionally?

> > We could improve on that by generating a list of unique lines in all the
> > fragments without keeping the fragments in memory, and use that to
> > prepare a patience diff, then pass over that doing the actual merge with
> > a fragment-cache.
> 
> We could also approximate this by storing a cheap checksum of each line,
> and doing an initial match based on the checksums.

Thats a good possibility. We'd need to look closely at the storage
cost/access methods for this. At 4 bytes for instance, checksums would
be getting pretty large compared to some lines. OTOH we could store and
index the checksums, and also only store checksums for unique lines
(which is all that patience sequence matcher needs).

> Or another alternative would be to use the compression deltas to seed
> the diff.  This requires a line-based delta approach, but has the
> advantage that it cannot produce false matches, only false mismatches.

It requires more than that - it requires a compression delta that
matches the graph merge and sequence matcher being used by merge.
Arbitrary line based deltas are not so useful (and groupcompress fits in
that category - it has line based deltas, but they don't match up with
the sequence matcher logic merge wants).

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080826/631642d2/attachment.pgp