Large Binary Files

John Arbash Meinel john at arbash-meinel.com
Thu Oct 14 15:30:48 BST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/14/2010 7:22 AM, Gordon Tyler wrote:
> On 10/14/2010 1:10 AM, Chris Hecker wrote:
>>> Considering all the above, along with the 2GB limit (at least on 32
>>> bit workstations), it seems like DVCS is not quite ready for prime
>>> time here.
> 
>> The 2gb limit for a single file is not usually a problem, since it's
>> rare that a single file will be that big.  Unless the 2gb limit is on
>> the total history of a file, and then that's a problem...
> 
> Actually, I believe the 2GB limit is on the process memory in a 32-bit
> operating system. So the amount of memory bzr consumes in a single
> operation cannot exceed 2GB on a 32-bit operating system.
> 
> Unfortunately, bzr has a few problems where although the size of your
> file(s) may not be that large, the methods that bzr uses to process them
> in memory while committing, pulling, pushing, etc. can use more memory
> than the files would occupy normally, thus reaching the 2GB limit when
> you might not expect it.
> 
> Work is being done on creating a 64-bit installer for Windows, which is
> where this problem is seen mostly.
> 
> Ciao,
> Gordon

I'm also currently working on improving the peak memory for repacking.
Right now commit is at 1 fulltext + 2 compressed texts peak. Which is
pretty good (ideally we'd be at 1+1, or change it dramatically, and just
work in streamed chunks of the file content).

Repacking is a much different affair, since you are now computing deltas
between content. Some people have asked for a knob that says "just don't
delta large content". That is pretty easy to implement, but I don't
think they realize how bad that will bloat their disk. (It does depend
on the specific content, but flagging the actual content with a
delta/don't delta is much more involved.)

I think we can get the delta compression down to at least 3x content
size. (1 source, 1 target, 1 in-memory data structures). Which at least
drops the 5-6x down to 3x and gets some breathing room. I'd like to be
at 2x+small structures, but the structures just require a fair amount of
space. We could trade off delta accuracy for memory consumption pretty
easily, though. (Only track every 1000th byte, rather than every byte, etc.)

"streaming" delta compression is kind of hard to do, especially given
our compression design. You could potentially use a temp file and mmap,
but that doesn't solve the 'size-of-vm-space' issue.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAky3FBgACgkQJdeBCYSNAAP/ZwCeIjINdwkdqzqvyjTNCRiOC4+Q
ajUAnjmMpNggVSvDYSE4aQmZ9xRjaKPi
=TntX
-----END PGP SIGNATURE-----



More information about the bazaar mailing list