[MERGE] Implement chunked body encoding for the smart protocol.

Fri Nov 9 20:43:22 GMT 2007

John Arbash Meinel wrote:
> John Arbash Meinel has voted tweak.
> Status is now: Conditionally approved
> Comment:
> In general, it looks decent, and the tests seem good. (They seem to be
> checking the on-the-wire values, which is very good.)
> I'm wondering if we don't want to annotate the on-the-wire tests somehow
> so that people realize these are important to never change. Certainly,
> people can work it out for themselves, but in general with TDD you are
> expected to refactor tests, and it is always unfortunate when the someone
> misses that the tests are updated to make X compatible with itself,
> breaking compatibility with everything else.

I've added a note about protocol versioning to the docstring.  We could really
do with a more comprehensive way to ensure cross-version compatibility, but this
will do for now.

> I sort of feel like chunks should end in a newline, rather than starting
> the next length prefix right away. I realize it is a bit arbitrary, but
> looking at the bytes on the wire, you get:
>
>   3
>   foo4
>   bing5
>   chunkEND
>
> This would also allow a bit more protocol synchronization, so that if
> there were any bugs in the stream, they would be caught right away as bad
> data. Otherwise if the data has ascii digits at the end, you could easily
> read the wrong number of bytes. Sure this is unlikely, but coupled with
> being nicer to humans, I think it is worthwhile. Oh, it also means that
> when there are digits at the end, it it still human parsable, rather than:
>
>   4
>   foo14
>   bingEND
>
> Where it is actually "foo1" and "bing".

I've thought about this, but I'm not actually convinced it's worth worrying
about.  The benefits seem pretty slight in practice.  If you really are trying
to debug the data stream it's not too hard to detangle this sort of stuff, and
you'll have to cope with non-human-friendly stuff like non-ascii bytes anyhow.

> +The request protocol is::
> +
> +  REQUEST_V2 := "bzr request 2" NEWLINE MESSAGE_V2
> +
> +The response protocol is::
> +
> +  RESPONSE_V2 := "bzr response 2" NEWLINE MESSAGE_V2
>
> ^- Didn't we want to start sending the client and server bzrlib versions?
> Something like:
>
> "bzr request 3" NEWLINE BZR_CLIENT_VERSION NEWLINE MESSAGE_V3
> "bzr response 3" NEWLINE BZR_SERVER_VERSION NEWLINE MESSAGE_V3
>
> I realize this isn't relevant for your specific patch, but as it is a
> format bump, maybe we want to get it sooner rather than later.

We do want to include this in future version strings, but this actually already
the current version string, so I can't change it in this patch.  Future versions
should be like you say, but this part of the document is just describing what we
have.

I agree that we should do that format bump sooner rather than later, ideally
before 1.0.  It shouldn't be particularly hard; patches welcome!

> ...
> +For compatibility will all versions (past and future) of bzr clients,
> +servers that receive a request in an unknown protocol version should
> +respond with a single-line error terminated with 0x0a (NEWLINE), rather
> +than structured response prefixed with a version string.
>
> ^- Since we are asserting that this should be the official way of handling
> unknown protocol failures, I would like us to get this right, rather than
> restricting ourselves to how early versions of the protocol worked.
>
> I'm fine with breaking compatibility with 0.11 clients as long as there is
> a clear win in error clarity. (I'd probably rather avoid breaking 0.92
> clients, but we should consider a migration plan if possible.)

Well, what improvement in clarity do you want that this restriction prevents?

I'm happy to loosen this if we think it's worth it, but a new-line terminated
string for this case seems adequate, and provides an ok result for even ancient
bzr versions.

We can always change this specification later if we decide to change this
policy.  In the meantime it's good to have the rationale for the current state
of the protocol clear.

> ...
> +The end of the body is signaled by 'END\\n', or by 'ERR\\n' followed by
> +error args, one per chunk.  Note that this allows an 8-bit clean error
> +response.
>
> ^- Why are you using '\\' here? (It may be required by ReST, but it looks
> a little funny to me.).

Yep, required by ReST.  Take a look at the HTML version produced by “make docs”.

> Also, these might be better as: ``'END\n'``, so that they get formatted in
> fixed-width font.

Done.

> Also, "The end of the body is signlaed by ... followed by error args, one
> per chunk."
> The "one per chunk" part doesn't seem to fit. Is each chunk ended with
> 'END', do you have to give an ERR for each chunk?

It's as the more formal description above said:

  ERROR_TERMINATOR := 'ERR' NEWLINE CHUNKS SUCCESS_TERMINATOR

So an error can be decoded as e.g. a tuple of strings.  It's an alternative
encoding to the ARGS encoding used for the request args, with the advantage of
being safe for transmitting any 8-bit data.

> When you say "8-bit clean" does that mean you can send an 8-bit response,
> or does it mean you have to clean the response of 8th bits. I *think* you
> are trying to say you can send 8-bit error responses. And I think the
> 'clean' is just confusing.

I am trying to say you can send 8-bit error responses (in contrast to the ARGS
encoding, which cannot safely transmit certain bytes and provides no way to
escape them).  I've tweaked the text to make this clearer.

>
> ...
> +**XXX**: ideally the request methods should be documented here.
> +Contributions welcome!
>
> ^- It seems like we should try to make the list of requests self
> documenting (through doc strings, etc) and have that doc auto-generated.
> Otherwise there is always going to be skew as someone implements a new
> request.

Possibly, although I think a network protocol should not just be specified by
its implementation, so I'm not sure that auto-generation is good enough.

> At a minimum, I think the doc should be somewhere other than in this file,
> as there are going to be lots of requests, which makes this document a bit
> unweildy.

We can worry about the unweildiness in this file once we actually have this
section, rather than just an XXX ;)

>
> ...
>
> +    def _send_response(self, response):
> +        """Send a smart server response down the output stream."""
> +        assert not self._finished, 'response already sent'
> +        self._finished = True
> +        self._write_protocol_version()
> +        self._write_success_or_failure_prefix(response)
> +        self._write_func(_encode_tuple(response.args))
> +        if response.body is not None:
> +            assert isinstance(response.body, str), 'body must be a str'
> +            bytes = self._encode_bulk_data(response.body)
> +            self._write_func(bytes)
> +        elif response.body_stream is not None:
> +            _send_stream(response.body_stream, self._write_func)
>
> ^- We probably want some asserts that if body is not None, then
> body_stream is None.

Good idea, done.

>
> Why is "_send_stream" broken up into a _send_chunks helper function. At
> least from what I see, _send_chunks is only trivially different.
>
> I guess this is why:
> +        elif isinstance(chunk, request.FailedSmartServerResponse):
> +            write_func('ERR\n')
> +            _send_chunks(chunk.args, write_func)
> +            return
> +        else:
>
> ^- But why would you send a series of chunks for an error? Isn't it just a
> standard error command, like you would see for any other request?

Because this way errors can be safely transmitted even if they have 0x01 or 0x0a
bytes, which is a flaw in the the request arg encoding.

I could have reused the existing encoding, but seeing as this is a new protocol
extension I took the opportunity to avoid the 8-bit unsafeness flaw for this
part rather than perpetuating it.

> Oh, and why is "chunk.args" a stream?

Why not? ;)

It just seemed convenient.  Errors on the smart protocol are already represented
by tuples, and tuples are “streams”; they are an iterable of strings.

> Also, your design document says that the response needs to start with
> "chunked\n" to distinguish it, but I don't see that string being written
> anywhere. It might make sense to either put it at the start of
> "_send_chunks" or at the start of "_send_stream" depending on what you
> intend.
> I certainly didn't see 'chunked' in the test suite either. At a minimum
> the doc needs to be updated, but I would probably actually like to see
> 'chunked' in the output stream (unless it makes the parser much more
> complicated.)

Oops, I missed this!  Thanks for noticing.

I've fixed the code to do this.

> ...
>
> +    def next_read_size(self):
> +        # Note: the shortest possible chunk is 2 bytes: '0\n', and the
> +        # end-of-body marker is 4 bytes: 'END\n'.
> +        if self.state_accept == self._state_accept_reading_chunk:
>
> ^- I realize these are the bare minimums of what *could* be coming next.
> But I feel like you are really pessimising things. I suppose you are just
> trying to not consume too many bytes from the socket, but I would rather
> see some sort of peek-ahead buffering layer. It could be pretty simple,
> and means that you don't have a sys-call just because you aren't sure how
> many characters they are going to use for the length. In fact, what if we
> just made the length string fixed-size, especially for chunked data.
> There probably isn't a good reason to have chunks > 100k, so why not just
> make the length string 5 characters long?
[...]

Arbitrary limits tend to cause headaches down the track, so I'd rather not do
that.

Also, “next_read_size” is a hint for media that need it.  The
SmartTCPClientMedium for instance doesn't use it.  It's there so that for e.g.
pipes (which block if you try to read too much in a naïve way) we can do the
right thing.

It would be nice to never need next_read_size, but this isn't a new problem in
this branch; it's an existing issue with the smart server code.  The
accept_bytes API is perfectly happy to receive as many bytes as you give it, so
if we want to improve e.g. reading from pipes to read optimistically, we don't
need to change this decoder (except to delete next_read_size ;).

> ...
> +    def test_body_stream_error_serialistion(self):
> +        stream = ['first chunk',
> +                  request.FailedSmartServerResponse(
> +                      ('FailureName', 'failure arg'))]
> +        expected_bytes = (
> +            'b\nfirst chunk' +
> +            'ERR\n' + 'b\nFailureName' + 'b\nfailure arg' +
> +            'END\n')
> +        self.assertBodyStreamSerialisation(expected_bytes, stream)
> +        self.assertBodyStreamRoundTrips(stream)
>
> ^- How are these 'b\n' rather than '11\nfirst chunk'...
> Is there something I'm missing? (Certainly the number could be in hex,
> though I don't quite understand why we want to do that.)

It's in hex.

It could be in decimal instead, but I don't quite understand why we'd want to do
that ;)

-Andrew.