large files and storage formats

Fri Jul 9 06:46:25 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chad Dombrova wrote:
> hi all, 
> i've got some questions regarding bzr and large binary files.  
> 
> first of all, i've read about bzr's long-standing issues with large
> files (https://bugs.launchpad.net/bzr/+bug/109114).  while fixing this
> issue would be a worthy and noble cause, i have a fairly specific use
> case, and based on a lot of recent experience i know there's a *very*
> high probability that once this issue is fixed i'll run into other
> roadblocks with the current storage format. 
> 
> what interests me about bazaar is what the docs tout as its flexible
> architecture: that it "is cleanly layered to support multiple file
> formats".  that got me thinking: could i implement a more git-like loose
> object storage format into bazaar?
> 
> for those who aren't familiar with git's loose object model, it works
> something like this:  blobs represent data, trees represent the location
> of data, a commit represents a change, and every object, regardless of
> type, is stored as a separate loose file in the store.
> 

I'll note that if all you want is for content objects that are greater
than some threshold to not be delta-compressed, you can do:

=== modified file 'bzrlib/groupcompress.py'
- --- bzrlib/groupcompress.py     2010-05-20 02:57:52 +0000
+++ bzrlib/groupcompress.py     2010-07-09 05:41:30 +0000
@@ -1721,12 +1721,7 @@
                                                nostore_sha=nostore_sha)
             # delta_ratio = float(len(bytes)) / (end_point - start_point)
             # Check if we want to continue to include that text
- -            if (prefix == max_fulltext_prefix
- -                and end_point < 2 * max_fulltext_len):
- -                # As long as we are on the same file_id, we will fill
at least
- -                # 2 * max_fulltext_len
- -                start_new_block = False
- -            elif end_point > 4*1024*1024:
+            if end_point > 4*1024*1024:
                 start_new_block = True
             elif (prefix is not None and prefix != last_prefix
                   and end_point > 2*1024*1024):


That will leave you with repositories that are considered valid 2a
format repositories, just not as 'packed' as we would normally make them.

I would guess there will be other places where our memory will be larger
than you might like. But at least for the 'compressing 2 large blobs
together takes too much memory' case, it would side step it.

'large' in this case is >4MB.

You could probably even do a little bit better, by checking the length
of the content before calling 'self._compressor.compress()', and
choosing to start a new block right away.

We don't currently abstract that logic as much as would be nice. So if
you want to play with that code a bit, and potentially make it more
flexible, we'd probably be interested in patches.

You could, for example, create a configuration variable that would
indicate various thresholds to use in the compression algorithm. This
would allow people to set it to whatever they wanted in a given
repository/branch/etc.

The best part is that it stays 2a compatible, so you don't have to worry
about bzr's without your changes being able to read your disk format.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw2t7EACgkQJdeBCYSNAAMW6QCgxQBMdTzsjUYya9DKwCamQv+/
SZsAn0W5w916tZxjcy8p45o43SMEsqr0
=2y28
-----END PGP SIGNATURE-----