RFC: handlings large files via fragmenting

Mon Aug 25 14:41:17 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> We could improve on that by generating a list of unique lines in all the
>> fragments without keeping the fragments in memory, and use that to
>> prepare a patience diff, then pass over that doing the actual merge with
>> a fragment-cache.
> 
> We could also approximate this by storing a cheap checksum of each line,
> and doing an initial match based on the checksums.

I'm pretty sure that is what Robert was meaning. PatienceDiff keeps a
dictionary of hash(line) values. The Python version just shoves them in a
dictionary, but the pyrex version tracks the hash(line) => offset. (And then,
it uses the line list itself to find the chain of lines that match a given hash.)

The small problem I see is that you would still need N 'line' structs, but I
suppose if it is a ISO sort of file, the line data itself is probably very
long. So len(lines) is probably small while len(lines[X]) is large.

> 
> Or another alternative would be to use the compression deltas to seed
> the diff.  This requires a line-based delta approach, but has the
> advantage that it cannot produce false matches, only false mismatches.
> 
> Aaron

It also requires a format that has these readily available. For example,
groupcompress doesn't really have the deltas that you might be interested in.
(It could have deltas, but probably not against the texts you would like them
to be against.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsrZ8JdeBCYSNAAMRAn3KAKCyA8oq+y6qBwax5Y6MTaCAJ4pnTACgrMBJ
0e0IkxlMJxA9HhOkcuviOsw=
=Gauj
-----END PGP SIGNATURE-----