RFC: handlings large files via fragmenting

Tue Aug 26 00:15:22 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:

...

>>
>> Once we have efficient read and write, we can work on making diff and
>> merge more memory-efficient.  We could apply the techniques you describe
>> below, simply by reading 10M at a time, and handling them as your
>> "fragments".
> 
> One difficulty that would arise is that we'd need to offer a 'seek'
> idiom to move between parts of a file. That is, if we hide this from
> high layers they will have lower memory use, but have to 'open' and
> 'seek' to get to a particular region of the file without fully
> decompressing it. If we expose this to higher layers, perhaps they can
> do more sophisticated things and reduce overall IO and processing. 
> 
> I don't have a firm answer in my head about this. I think there is some
> elegance in not exposing 'how' files are stored, but on the other hand
> ISO's and other files larger than 1/4 of main memory really are
> exceptional circumstances to handle; so perhaps its ok to handle them
> exceptionally?

It seems like a new object type might be a good way to expose this. Or at
least tell the rest of the code that they should fall back to a different way
of processing.

I suppose if you made the api work for any file, you could just trigger it on
a "IF.text_size > XXXX".

> 
>>> We could improve on that by generating a list of unique lines in all the
>>> fragments without keeping the fragments in memory, and use that to
>>> prepare a patience diff, then pass over that doing the actual merge with
>>> a fragment-cache.
>> We could also approximate this by storing a cheap checksum of each line,
>> and doing an initial match based on the checksums.
> 
> Thats a good possibility. We'd need to look closely at the storage
> cost/access methods for this. At 4 bytes for instance, checksums would
> be getting pretty large compared to some lines. OTOH we could store and
> index the checksums, and also only store checksums for unique lines
> (which is all that patience sequence matcher needs).

Considering there are 26,000 "\n" lines in the history of bzrlib/builtins.py.
That does, indeed, cause some bloat. Though again, if it were only triggered
when file sizes were larger than some threshold (50MB?) then the types of
files would be less likely to actually have lots of short duplicated lines.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsz0KJdeBCYSNAAMRAq+tAJ9ME3EqMQTD0FU40nXVgnJnCTIB2gCgkpFz
gLlj1DabT0Q8z/iHJL7W9Dw=
=HFV9
-----END PGP SIGNATURE-----