RFC: handlings large files via fragmenting

Tue Aug 26 01:43:06 BST 2008

On Mon, 2008-08-25 at 18:15 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> 
> ...
> 
> >>
> >> Once we have efficient read and write, we can work on making diff and
> >> merge more memory-efficient.  We could apply the techniques you describe
> >> below, simply by reading 10M at a time, and handling them as your
> >> "fragments".
> > 
> > One difficulty that would arise is that we'd need to offer a 'seek'
> > idiom to move between parts of a file. That is, if we hide this from
> > high layers they will have lower memory use, but have to 'open' and
> > 'seek' to get to a particular region of the file without fully
> > decompressing it. If we expose this to higher layers, perhaps they can
> > do more sophisticated things and reduce overall IO and processing. 
> > 
> > I don't have a firm answer in my head about this. I think there is some
> > elegance in not exposing 'how' files are stored, but on the other hand
> > ISO's and other files larger than 1/4 of main memory really are
> > exceptional circumstances to handle; so perhaps its ok to handle them
> > exceptionally?
> 
> It seems like a new object type might be a good way to expose this. Or at
> least tell the rest of the code that they should fall back to a different way
> of processing.
> 
> I suppose if you made the api work for any file, you could just trigger it on
> a "IF.text_size > XXXX".

Directories don't have a text_size per se; I'm still working on stuff
there - it might be nice to harmonise everything.

> >>> We could improve on that by generating a list of unique lines in all the
> >>> fragments without keeping the fragments in memory, and use that to
> >>> prepare a patience diff, then pass over that doing the actual merge with
> >>> a fragment-cache.
> >> We could also approximate this by storing a cheap checksum of each line,
> >> and doing an initial match based on the checksums.
> > 
> > Thats a good possibility. We'd need to look closely at the storage
> > cost/access methods for this. At 4 bytes for instance, checksums would
> > be getting pretty large compared to some lines. OTOH we could store and
> > index the checksums, and also only store checksums for unique lines
> > (which is all that patience sequence matcher needs).
> 
> Considering there are 26,000 "\n" lines in the history of bzrlib/builtins.py.
> That does, indeed, cause some bloat. Though again, if it were only triggered
> when file sizes were larger than some threshold (50MB?) then the types of
> files would be less likely to actually have lots of short duplicated lines.

Well, thats why storing only the unique line hashes might be good ;)

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080826/e1b1fb7c/attachment.pgp