large files and storage formats

Sat Jul 10 05:04:46 BST 2010

This is very interesting.  Having a per-branch configurable threshold
for compression would be very appreciated.  I expect this would speed
up our versioned binaries by eliminating a lot of thrashing.

~M

On 7/8/10, John Arbash Meinel <john at arbash-meinel.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Chad Dombrova wrote:
>> hi all,
>> i've got some questions regarding bzr and large binary files.
>>
>> first of all, i've read about bzr's long-standing issues with large
>> files (https://bugs.launchpad.net/bzr/+bug/109114).  while fixing this
>> issue would be a worthy and noble cause, i have a fairly specific use
>> case, and based on a lot of recent experience i know there's a *very*
>> high probability that once this issue is fixed i'll run into other
>> roadblocks with the current storage format.
>>
>> what interests me about bazaar is what the docs tout as its flexible
>> architecture: that it "is cleanly layered to support multiple file
>> formats".  that got me thinking: could i implement a more git-like loose
>> object storage format into bazaar?
>>
>> for those who aren't familiar with git's loose object model, it works
>> something like this:  blobs represent data, trees represent the location
>> of data, a commit represents a change, and every object, regardless of
>> type, is stored as a separate loose file in the store.
>>
>
> I'll note that if all you want is for content objects that are greater
> than some threshold to not be delta-compressed, you can do:
>
> === modified file 'bzrlib/groupcompress.py'
> - --- bzrlib/groupcompress.py     2010-05-20 02:57:52 +0000
> +++ bzrlib/groupcompress.py     2010-07-09 05:41:30 +0000
> @@ -1721,12 +1721,7 @@
>                                                 nostore_sha=nostore_sha)
>              # delta_ratio = float(len(bytes)) / (end_point - start_point)
>              # Check if we want to continue to include that text
> - -            if (prefix == max_fulltext_prefix
> - -                and end_point < 2 * max_fulltext_len):
> - -                # As long as we are on the same file_id, we will fill
> at least
> - -                # 2 * max_fulltext_len
> - -                start_new_block = False
> - -            elif end_point > 4*1024*1024:
> +            if end_point > 4*1024*1024:
>                  start_new_block = True
>              elif (prefix is not None and prefix != last_prefix
>                    and end_point > 2*1024*1024):
>
>
> That will leave you with repositories that are considered valid 2a
> format repositories, just not as 'packed' as we would normally make them.
>
> I would guess there will be other places where our memory will be larger
> than you might like. But at least for the 'compressing 2 large blobs
> together takes too much memory' case, it would side step it.
>
> 'large' in this case is >4MB.
>
> You could probably even do a little bit better, by checking the length
> of the content before calling 'self._compressor.compress()', and
> choosing to start a new block right away.
>
> We don't currently abstract that logic as much as would be nice. So if
> you want to play with that code a bit, and potentially make it more
> flexible, we'd probably be interested in patches.
>
> You could, for example, create a configuration variable that would
> indicate various thresholds to use in the compression algorithm. This
> would allow people to set it to whatever they wanted in a given
> repository/branch/etc.
>
> The best part is that it stays 2a compatible, so you don't have to worry
> about bzr's without your changes being able to read your disk format.
>
> John
> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkw2t7EACgkQJdeBCYSNAAMW6QCgxQBMdTzsjUYya9DKwCamQv+/
> SZsAn0W5w916tZxjcy8p45o43SMEsqr0
> =2y28
> -----END PGP SIGNATURE-----
>
>