[Bulk] Re: Large Binary Files

Stephen J. Turnbull stephen at xemacs.org
Fri Oct 15 09:13:46 BST 2010


jbowtie at amathaine.com writes:

 > File big.iso (any file over some cut-off size) gets broken into 4MB
 > blocks (block size chosen at random for illustration purposes). Blocks
 > are assigned a guid and stored in .bzr/repository/blocks

A potential problem with this scheme is that either it doesn't give
very good compression, or it could be very expensive in computation,
since a change in size of the first block of the working file would
change offsets of later blocks.

 > Crucially this just reduces your memory usage to a couple of blocks at
 > a given time plus the overhead of the psuedo-file itself. So you
 > should be able to handle binary files of an arbitrary size on your
 > 32-bit platform just fine.

True.

 > At the same time this should actually
 > require fairly minimal changes to the core on-disk format, it just
 > needs to recognise the magic psuedo-file during certain operations.

If "certain operations" is only a few, then this would be a useful
feature for people who have very large files.  However, I worry that
this might apply to many operations and get embedded into the whole
system.  That wouldn't be so great for bzr as a whole, and it does't
much help the people who have issues with storage because of largish
files that change more or less frequently.

Which leads to the main point: I wonder if it might not be reasonably
easy to generalize the idea to sequences of arbitrary blocks.  Then
for random binaries you could just cut them up into uniformly-sized
blocks (figuring that there's no sharable structure).  But for
structured files (eg, images with header and data -- what if you just
edit a comment or add a copyright? you'd get big savings since the
binary blob doesn't change, or MPEG-derived formats with sequences of
frames), you could make the cuts at structure points (which could also
serve as sync points for diff algorithms as newlines do for text
diffs).



More information about the bazaar mailing list