RFC: handlings large files via fragmenting
Aaron Bentley
aaron at aaronbentley.com
Mon Aug 25 13:53:24 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> Either push some metadata about content objects out of the inventory
> into a per-object header, or define a new inventory entry type
> 'bigfile'. (Defining a new type is easiest). For a bigfile, the
> referenced content is not the file content, instead its the root of a
> tree containing the file content in a number of nodes. Each node could
> be quite large - say 50MB. For merge, a change from file->bigfile can
> attempt to merge much like a file->file merge.
This sounds a lot like something I suggested in Istanbul, but I was
proposing to do it at a lower level-- at the repository or
versionedfiles level.
I was trying to solve the problem that reads or writes of file texts
cause the whole file to be held in memory, for delta compression or text
reconstruction purposes. So for example, we could have a 10 MB ceiling
on the cost of reconstructing a text.
So my idea would not have helped merge or diff directly, but would have
no effect on Repository-and-higher APIs.
Once we have efficient read and write, we can work on making diff and
merge more memory-efficient. We could apply the techniques you describe
below, simply by reading 10M at a time, and handling them as your
"fragments".
> We could improve on that by generating a list of unique lines in all the
> fragments without keeping the fragments in memory, and use that to
> prepare a patience diff, then pass over that doing the actual merge with
> a fragment-cache.
We could also approximate this by storing a cheap checksum of each line,
and doing an initial match based on the checksums.
Or another alternative would be to use the compression deltas to seed
the diff. This requires a line-based delta approach, but has the
advantage that it cannot produce false matches, only false mismatches.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFIsqtE0F+nu1YWqI0RAsq3AJ4zpbDsCxZIJC6DvpP5Xv9TJReVGQCggxWk
c6fxJiT22xycYBD/euH6kyY=
=8D4X
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list