[MERGE] Re: Accelerating push via bzr:// and friends

Martin Pool mbp at canonical.com
Wed May 21 04:33:15 BST 2008

On Wed, May 21, 2008 at 12:55 PM, Andrew Bennetts <andrew at canonical.com> wrote:
> Martin Pool wrote:
>> On Tue, May 20, 2008 at 8:10 PM, Robert Collins
>> <robertc at robertcollins.net> wrote:
> [...]
>> >
>> > NODELAY stops the kernel buffering, so you get lots of packets on the
>> > wire; otoh it means when we are being chatty we don't have to wait.
>> > Perhaps we don't want NODELAY?
>> SO_NODELAY does not afaik change any behaviour to do with waiting for
>> acks, but only as Robert indicates changes transmission buffering.  I
>> don't think we would want it on in our case.
> We originally turned it on because it made things faster.  The test suite was
> noticeably slower without it, back when we were developing the initial HPSS
> code.  IIRC, without it we'd write a request to a socket, but then find that the
> kernel would hang on to it for some number of milliseconds before sending it,
> even though we were just waiting for a response.  Our first request is often
> quite small (e.g. the initial BzrDir.open request, even with headers etc, is
> still only 82 bytes long in v3), so it's understandable that the kernel might
> think it's worth waiting for more.  This situation is what the NODELAY flag is
> for, AIUI.
> We aren't ever as pathological as e.g. a user typing in a telnet session.  We
> sometimes wrote requests/responses in multiple calls because that was the most
> convenient thing to do.  In theory the network overhead of 3 TCP packets vs 1
> packet should be trivial, so the there's no downside to us using NODELAY.  So I
> think we should keep setting NODELAY.
> The problem I've seen with v3 when testing over a 500ms latency connection is
> that the very first write to the socket is sent immediately, but then later
> writes to send the rest of the request are delayed by 1000ms, because the kernel
> waits to see the first ACK before sending any more packets.  There's no flag I
> can see to change this (and even if there were, I doubt it would be portable).
> So I guess the only portable way to perform optimally here is to only call
> send(2) once, which at least has the advantage of being a simple to explain
> strategy. :)

Snader's "Effective TCP/IP Programming" says

  The Nagle algorithm is usually implemented by requiring that a small
segment not be sent if there is any unacknowledged data outstanding.
(...RFC 1122)  BSD-derived implementations, as well as many others,
bend this rule a little to allow sending a final small segment for a
large write on an idle connection.

This seems like a good match for the situation you describe above.
But setting NODELAY is just kind of hiding the problem, and not
totally effectively: we will send a stream of small packets when what
we really want is just to send the whole thing in one go.  And it
seems like the reliable way to do that is to do just a large send, as
you said.  If we've done that I wonder if nodelay will still have any

>> I think we are using Python file objects to send and receive, so
>> rather than buffering ourself we should make sure their buffering is
>> turned on, and then just flush them when we have finished sending the
>> message.
> We are actually using socket objects directly when talking over TCP.

We should try to make sure that we also do large writes to the pipe to
the child process when using ssh to give it the best chance at also
sending large packets.

> Also, just flushing at the end of a message isn't sufficient if there's a body
> stream.  So we need to be flushing after each chunk of a body stream as well as
> at the end of a message.


> The attached patch implements appropriate buffering, and adds tests for that
> buffering.  In my testing this brings a push of one revision of bzr.dev over
> 500ms latency TCP down from 3m 19s to 1m 23s.

Wow, nice result!

This seems to do unboundedly large buffering, which I think is too
much.  In extreme cases we may run out of memory, but even at smaller
layers send() may refuse to accept messages which are too big to
atomically pass to the stack.  I think buffering say 64 or 128k will
be enough to avoid small packets.  Also, (though this may not be a
problem at present) if we have a large body we do want to start
transmitting as soon as we can.  Ideally you would also test for this.

+class StubMediumRequest(object):
+    """A stub medium request that tracks the number of times accept_bytes is
+    called.
+    """
+    def __init__(self):
+        self.calls = []
+    def accept_bytes(self, bytes):
+        self.calls.append('accept_bytes')
+    def finished_writing(self):
+        self.calls.append('finished_writing')

PEP8 :)

Thanks for adding tests for this, I know it would have been tempting
to just rely on your own measurements.

Martin <http://launchpad.net/~mbp/>

More information about the bazaar mailing list