RFC: handlings large files via fragmenting

Robert Collins robertc at robertcollins.net
Sat Aug 23 10:21:52 BST 2008


I think we could get a good answer to the 'large files are hard on
memory and merge' by treating them as many smaller files and merging
text changes within individual fragements first, then looking across
fragments.

I'd like to know if this sounds sensible :) - some details follow.

Either push some metadata about content objects out of the inventory
into a per-object header, or define a new inventory entry type
'bigfile'. (Defining a new type is easiest). For a bigfile, the
referenced content is not the file content, instead its the root of a
tree containing the file content in a number of nodes. Each node could
be quite large - say 50MB. For merge, a change from file->bigfile can
attempt to merge much like a file->file merge.

Delta storage of a bigfile would delta each node of the tree with the
same algorithm file contents are delta'd today. We'd probably name the
components of the bigfile in the same namespace as content objects -
e.g. (fragment:N,FILEID, REVISIONID).

Pull of a bigfile root would need a second pass to pull the bigfile
nodes, unless the inventory included the fragement count - and perhaps
by doing that we could skip the bigfile internal node - just make it a
list of fragments.

For merge, I think we could start of crudely and just merge fragment <->
fragment - we'd probably get more conflicts than strictly needed, but it
would work, and I don't think we'd get false positive no-conflict
results.

We could improve on that by generating a list of unique lines in all the
fragments without keeping the fragments in memory, and use that to
prepare a patience diff, then pass over that doing the actual merge with
a fragment-cache. Fragments unaltered on both sides could even be
skipped completely - though I would expect that a fixed-size fragment
splitting algorithm would tend to find all fragments altering somewhat
on any change.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080823/48ce3b02/attachment.pgp 


More information about the bazaar mailing list