[Bulk] Re: Large Binary Files

Fri Oct 15 03:04:38 BST 2010

I've been thinking about how we could approach this better over the
last few days, and have the beginnings of an idea in my head. Ideally
we want to build on the techniques used for deduplication and single
instance storage, at least for large files.

My thought experiment looks a little like this:

File big.iso (any file over some cut-off size) gets broken into 4MB
blocks (block size chosen at random for illustration purposes). Blocks
are assigned a guid and stored in .bzr/repository/blocks

A psuedo-file is generated, consisting of a list of blocks and their
corresponding hashes. This gets stored in the CHK Map instead of the
actual file.

MAGIC_PSUEDOFILE
a.block = 123
b.block = 456
c.block = 789

When a big file changes, we use the hashes to figure out which blocks
have changed. We store only the changed blocks and update the
psuedo-file. This can also be used to speed up network operations like
pull when large files are concerned (pull the usual stuff plus any
referenced blocks).

MAGIC_PSUEDOFILE
a.block = 123
d.block = 654  //this is the block that changed
c.block = 789

Generating a working copy for revision X means you just grab the
psuedo-file and stream the list of blocks into a file (or potentially
just overwrite the blocks that are changed).

Crucially this just reduces your memory usage to a couple of blocks at
a given time plus the overhead of the psuedo-file itself. So you
should be able to handle binary files of an arbitrary size on your
32-bit platform just fine. At the same time this should actually
require fairly minimal changes to the core on-disk format, it just
needs to recognise the magic psuedo-file during certain operations.