"bzr branch" downloads 10x the whole repo size through "dumb" http server (format 2a)?

Tue May 7 14:05:52 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is caching the actual content of the files, etc in your
repository. Usually the cached objects max out at just 4MB. If you are
getting one that is 60MB, that means you have some content which is at
least ~30MB in size (if one hunk that we are storing is larger than
4MB, then we allow the block to grow to 2x the original size.)

Unfortunately, there really isn't anything we can do to make it
smaller, because that was as small as we could make it.

One option would be to poke at line 1808 in groupcompress.py and then
repack the repository. So instead of:
    if (prefix == max_fulltext_prefix
        and end_point < 2 * max_fulltext_len):
        # As long as we are on the same file_id, we will fill at least
        # 2 * max_fulltext_len
        start_new_block = False

You could do:
    if (prefix == max_fulltext_prefix
        and end_point < 2 * max_fulltext_len):
        # As long as we are on the same file_id, we will fill at least
        # 2 * max_fulltext_len
        if end_point > 45*1024*1024:
            # The max cache size is 50MB, so don't create bigger blocks
            start_new_block = True
        else:
            start_new_block = False

This will mean that the on-disk size might grow significantly if you
have a lot of nearly identical versions of large content. (We know
that this is similar content because the "prefix" of the identifier
matches.)

Anyway, making the change above and recompressing will still leave you
with a valid repository, so you can try it if you want. It is likely
that the final size of the repository will go up, but if you don't
have a lot of large content, then it shouldn't matter too much.

It is also possible that you just have 1 revision of a very large
file, that all by itself is >50MB. However, I don't think we would
download it repeatedly in that case.

John
=:->

On 5/4/2013 3:39 AM, Marcin Wojdyr wrote:
> On 4 May 2013 05:41, John Arbash Meinel <john at arbash-meinel.com>
> wrote:
> 
>> http://pastebin.com/KktitEte
>> 
>> Line 3097: 74.903  Adding the key
>> (<bzrlib.btree_index.BTreeGraphIndex object at 0x2898a10>, 13957,
>> 11581132) to an LRUSizeCache failed. value 63238385 is too big to
>> fit in a the cache with size 41943040 52428800
>> 
>> You have an item which is 64MB in a single item. And the cache
>> is sized as max size 52MB. So it downloads it, but tries to not
>> use up all of your memory, so evacuates it, and then it happens
>> again.
>> 
>> The max is set in bzrlib/groupcompress.py line 1225.
>> 
>> You can try setting that to 1024*1024*1024 and see how the http 
>> download works.
> 
> It helped! Now it is as fast as ssh+bzr.
> 
> Can I somehow repack the repository to avoid this problem without 
> changing bzr clients? What sort of items is stored in this cache?
> 
> thanks Marcin
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGJCkAACgkQJdeBCYSNAAP7KACginqsTwT+MPZGEmmKNyqek6C7
n7AAoLzFjgmPGpD15EbWQMECf7m9ciqi
=ffnJ
-----END PGP SIGNATURE-----