Optimising branching and merging big repositories between far away locations...

Tue Oct 28 20:09:42 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Asmodehn Shade wrote:
> Hi *,
> 
> Alright I recently updated to bzr 1.8.
> 
> I have repository of a few gigs size ( because of the size of the files
> in it ) with usually around 100 revisions in each branch.
> 
> I need to branch from one place to another ( usually quite far away =
> high latency ) or merge differences.
> However it appears to be quite slow.
> 
> Despite the bandwidth limitation and the time needed to transfer big
> files anyway, for some reasons it s much slower than scp for example.
> Also despite the big bandwidth ( few Mbps ) available the transfer rate
> can go down to 1Kbps and stay there for quite a long time...
> 
> I am using bzr+ssh ( the fastest protocol I could find... ) the
> repository format is whatever the default is (on 1.5 ) when you "bzr
> init-repo"
> 
> So I was wondering if someone here had advise on how I can make the
> overall branching / pulling / merging operations faster if possible...
> using more bandwidth or something else...
> 
> Thanks for your advice ;-)
> 
> --
> Alex

So there are a few possibilities with what could be happening. If you
can help debug further, doing runs with "-Dhpss" will add extra debug
information in ".bzr.log" (use bzr --version to find this file). That
will record what commands we are issuing, along with some timing
information.

As a guess, I would say we are likely to be slow during index
operations. Where we are probing for more information to see what we
need to do next.

I know Andrew Bennetts has a patch out that should help some cases. (For
push/pull we need to find out what one side has that the other doesn't.
We were doing it "one-revision" at a time, and Andrew updated that to
make several requests per round trip.)

That landed in bzr.dev as:
3795 Canonical.com Patch Queue Manager 2008-10-27 [merge]
     Reduce round-trips when pushing to an existing repo by using the
       get_parent_map RPC more,
       and batching calls to it in _walk_to_common_revisions. (Andrew
       Bennetts)

There are other possibilities...

1) You may consider issuing "bzr pack" on the repository. This will
collapse all of the history (so far) into a single pack file + index.
This can make things faster (in general looking something up in an index
is O(log N) so having M indexes is M * log N, rather than log (M*N).

We do a certain amount of packing automatically (we check after every
commit/push/pull). The automatic algorithm isn't very aggressive, as you
don't really want to redo your whole repository every commit.

2) We have a better index format written if you want to test it. I would
make a copy of your existing repository, and then do "bzr upgrade
- --development2". I believe all clients will need to be bzr 1.7 or greater.

The index format is stable, and we are mostly tuning the code before we
make it a public & stable repository format. For instance, the old index
code had logic to allow it to prefetch extra data, and I just landed the
code to do so for the new index code. In many cases the new format is
sufficiently better that even without prefetching it was faster (often
significantly so).

If you are interested, we certainly would be interested in getting
feedback of how well it performs for you.

3) Speaking of 'prefetch', you could tune the prefetch algorithm a
little bit. Probably the value in question comes from:
bzrlib/transport/remote.py

Around line 304 there should be:

def recommended_page_size(self):
    """Return the recommended page size for this transport."""
    return 64 * 1024

You could play around with that value and see if larger values work
better for you.

For example, you could set it to 64MB (64 * 1024 * 1024) instead of
64kB. That would likely cause the prefetch code to just always read the
whole index with every request, rather than just reading a little bit at
a time.

4) It is possible that 'sftp://' might be faster than 'bzr+ssh://' for
some operations. Mostly because of the prefetch code, which is what
Andrew was working on. For push specifically, we would be issuing a
bunch of "do you have revision X" requests, which the remote would
respond with "no". When using sftp:// we are reading the remote index,
so we actually get back "no, but I have these 50 random revisions".
(Interestingly, if remote does have the revision, then it responds with
"yes, *and* I have these 50 ancestors as well".)

5) I think someone commented that you can actually do
"nosmart+bzr+ssh://" which turns of the smart-protocol requests, but
retains the better file-access behavior from bzr+ssh. (sftp has a
problem that to read a little bit from a file, you have to issue an
'open + read + close', while bzr+ssh can do the whole request with a
single 'read' request.)

So you *might* try doing the same action with "nosmart+bzr+ssh://" and
see if that changes things.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkHcYYACgkQJdeBCYSNAAN/fQCfYurUef81d29+go8Y2RXv74G9
SqYAoK6vjijG/qU+bahBCs/8+4qI6pdN
=mvv1
-----END PGP SIGNATURE-----