[MERGE] Implement chunked body encoding for the smart protocol.

John Arbash Meinel john at arbash-meinel.com
Mon Oct 22 18:43:03 BST 2007


John Arbash Meinel has voted tweak.
Status is now: Conditionally approved
Comment:
In general, it looks decent, and the tests seem good. (They seem to be
checking the on-the-wire values, which is very good.)
I'm wondering if we don't want to annotate the on-the-wire tests somehow
so that people realize these are important to never change. Certainly,
people can work it out for themselves, but in general with TDD you are
expected to refactor tests, and it is always unfortunate when the 
someone
misses that the tests are updated to make X compatible with itself,
breaking compatibility with everything else.

I sort of feel like chunks should end in a newline, rather than starting
the next length prefix right away. I realize it is a bit arbitrary, but
looking at the bytes on the wire, you get:

   3
   foo4
   bing5
   chunkEND

This would also allow a bit more protocol synchronization, so that if
there were any bugs in the stream, they would be caught right away as 
bad
data. Otherwise if the data has ascii digits at the end, you could 
easily
read the wrong number of bytes. Sure this is unlikely, but coupled with
being nicer to humans, I think it is worthwhile. Oh, it also means that
when there are digits at the end, it it still human parsable, rather 
than:

   4
   foo14
   bingEND

Where it is actually "foo1" and "bing".


+The request protocol is::
+
+  REQUEST_V2 := "bzr request 2" NEWLINE MESSAGE_V2
+
+The response protocol is::
+
+  RESPONSE_V2 := "bzr response 2" NEWLINE MESSAGE_V2

^- Didn't we want to start sending the client and server bzrlib 
versions?
Something like:

"bzr request 3" NEWLINE BZR_CLIENT_VERSION NEWLINE MESSAGE_V3
"bzr response 3" NEWLINE BZR_SERVER_VERSION NEWLINE MESSAGE_V3

I realize this isn't relevant for your specific patch, but as it is a
format bump, maybe we want to get it sooner rather than later.

...
+For compatibility will all versions (past and future) of bzr clients,
+servers that receive a request in an unknown protocol version should
+respond with a single-line error terminated with 0x0a (NEWLINE), rather
+than structured response prefixed with a version string.

^- Since we are asserting that this should be the official way of 
handling
unknown protocol failures, I would like us to get this right, rather 
than
restricting ourselves to how early versions of the protocol worked.

I'm fine with breaking compatibility with 0.11 clients as long as there 
is
a clear win in error clarity. (I'd probably rather avoid breaking 0.92
clients, but we should consider a migration plan if possible.)

...
+The end of the body is signaled by 'END\\n', or by 'ERR\\n' followed by
+error args, one per chunk.  Note that this allows an 8-bit clean error
+response.

^- Why are you using '\\' here? (It may be required by ReST, but it 
looks
a little funny to me.).
Also, these might be better as: ``'END\n'``, so that they get formatted 
in
fixed-width font.

Also, "The end of the body is signlaed by ... followed by error args, 
one
per chunk."
The "one per chunk" part doesn't seem to fit. Is each chunk ended with
'END', do you have to give an ERR for each chunk?

When you say "8-bit clean" does that mean you can send an 8-bit 
response,
or does it mean you have to clean the response of 8th bits. I *think* 
you
are trying to say you can send 8-bit error responses. And I think the
'clean' is just confusing.

...
+**XXX**: ideally the request methods should be documented here.
+Contributions welcome!

^- It seems like we should try to make the list of requests self
documenting (through doc strings, etc) and have that doc auto-generated.
Otherwise there is always going to be skew as someone implements a new
request.
At a minimum, I think the doc should be somewhere other than in this 
file,
as there are going to be lots of requests, which makes this document a 
bit
unweildy.

...

+    def _send_response(self, response):
+        """Send a smart server response down the output stream."""
+        assert not self._finished, 'response already sent'
+        self._finished = True
+        self._write_protocol_version()
+        self._write_success_or_failure_prefix(response)
+        self._write_func(_encode_tuple(response.args))
+        if response.body is not None:
+            assert isinstance(response.body, str), 'body must be a str'
+            bytes = self._encode_bulk_data(response.body)
+            self._write_func(bytes)
+        elif response.body_stream is not None:
+            _send_stream(response.body_stream, self._write_func)

^- We probably want some asserts that if body is not None, then
body_stream is None.

   assert response.body_stream is None, "You can only set one of body or
   body_stream."

I think this is appropriate for an assert, since it is helping a
developer.

...

Why is "_send_stream" broken up into a _send_chunks helper function. At
least from what I see, _send_chunks is only trivially different.

I guess this is why:
+        elif isinstance(chunk, request.FailedSmartServerResponse):
+            write_func('ERR\n')
+            _send_chunks(chunk.args, write_func)
+            return
+        else:

^- But why would you send a series of chunks for an error? Isn't it just 
a
standard error command, like you would see for any other request?

Oh, and why is "chunk.args" a stream?

Also, your design document says that the response needs to start with
"chunked\n" to distinguish it, but I don't see that string being written
anywhere. It might make sense to either put it at the start of
"_send_chunks" or at the start of "_send_stream" depending on what you
intend.
I certainly didn't see 'chunked' in the test suite either. At a minimum
the doc needs to be updated, but I would probably actually like to see
'chunked' in the output stream (unless it makes the parser much more
complicated.)

...

+    def next_read_size(self):
+        # Note: the shortest possible chunk is 2 bytes: '0\n', and the
+        # end-of-body marker is 4 bytes: 'END\n'.
+        if self.state_accept == self._state_accept_reading_chunk:

^- I realize these are the bare minimums of what *could* be coming next.
But I feel like you are really pessimising things. I suppose you are 
just
trying to not consume too many bytes from the socket, but I would rather
see some sort of peek-ahead buffering layer. It could be pretty simple,
and means that you don't have a sys-call just because you aren't sure 
how
many characters they are going to use for the length. In fact, what if 
we
just made the length string fixed-size, especially for chunked data.
There probably isn't a good reason to have chunks > 100k, so why not 
just
make the length string 5 characters long?

At least that way you can read a few bytes and get your stuff in one
attempt, rather than reading 1 byte at a time. (I guess you cheat a bit 
by
buffering an extra 4 characters for 'END\n')

What you have seems okay, but a lot of extra syscalls rather than just
buffering and reading in ~1k chunks.
Then again, will 'read(100)' block if there are only 10 bytes to read?
(And if you make it non-blocking, how does that interact with retrying
until you get enough bytes.)

...
+    def test_body_stream_error_serialistion(self):
+        stream = ['first chunk',
+                  request.FailedSmartServerResponse(
+                      ('FailureName', 'failure arg'))]
+        expected_bytes = (
+            'b\nfirst chunk' +
+            'ERR\n' + 'b\nFailureName' + 'b\nfailure arg' +
+            'END\n')
+        self.assertBodyStreamSerialisation(expected_bytes, stream)
+        self.assertBodyStreamRoundTrips(stream)

^- How are these 'b\n' rather than '11\nfirst chunk'...
Is there something I'm missing? (Certainly the number could be in hex,
though I don't quite understand why we want to do that.)



For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3C20071022071933.GC13260%40steerpike.home.puzzling.org%3E



More information about the bazaar mailing list