[MERGE] HTTP smart server

Mon Oct 9 07:49:51 BST 2006

John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Andrew Bennetts wrote:
> > The attached bundle implements the "smart" protocol from HTTP.  It includes the
> > client, and a testing server.  It won't actually be used by "http://" URLs yet.
> > 
> > The basic mechanism is that smart requests are POSTed to base_url +
> > '.bzr/smart', and replies read from the body of the HTTP response.
> > 
> > Part of this work has been to seperate the protocol from the "medium" (I would
> > have used the term transport, much like Twisted does, but this term already has
> > a quite different meaning in bzrlib).  The resulting code in smart.py is
> > generally cleaner and more flexible.
> > 
> > -Andrew.
> > 
> 
> ...
> 
> > === modified file BRANCH.TODO
> > --- BRANCH.TODO
> > +++ BRANCH.TODO
> > @@ -1,3 +1,4 @@
> >  # This file is for listing TODOs for branches that are being worked on.
> >  # It should ALWAYS be empty in the mainline or in integration branches.
> >  # 
> > +
> 
> ^- obviously you used BRANCH.TODO, but then finished everything that was
> listed. It is usually good to just do a once-over your diff before you
> submit to catch things like this.

Fixed.

> > === modified file bzrlib/errors.py // last-changed:andrew.bennetts at canonical.co
> > ... m-20060928040135-c2c58bbb690acc08
> > --- bzrlib/errors.py
> > +++ bzrlib/errors.py
> > @@ -270,6 +270,14 @@
> >      """Directory not empty: %(path)r%(extra)s"""
> >  
> >  
> 
> v- It seems that ReadingCompleted, WritingCompleted, and
> WritingNotCompleted might be reasonable to encapsulate in a heirarchy,
> rather than being independent. Also, and more importantly, which ones of
> these are actually "user" errors that should only create a single line
> error message, and which are programming bugs, which should generate a
> traceback?

I think they're programming bugs, not errors that users should ever see.

I guess there might be a small readability advantage to putting these in a
hierarchy, to make it clearer that they're related.  It feels a bit odd, because
I can't think of any reason why you'd want to write "except
MediumRequestMisused:" (or whatever it would be called).  So I'm leaning towards
leaving it as is, on the basis that anything further would be a YAGNI.

> v- Your default '_expect_body_tail' is None, but then later on you use a
> plain:
> 
>   while not self.received_bytes.endswith(self._expect_body_tail):
> 
> So if _expect_body_tail is None, this will fail with a weird error.
> 
> This seems to mean that _expect_body_tail should either be strictly
> required, or default to '' or some other string object.
> 
> It might even be reasonable to have 'assert isinstance(expect_body_tail,
> str)', since you explicitly expect that later...

Ideally, I'd make expect_body_tail a required keyword arg, but there's no syntax
for that in Python, so I settled for just making it a keyword argument with an
invalid default (None).  (I prefer a keyword arg here because it's a test helper
specific to this file, so casual readers are highly unlikely to be familiar with
it, so having the "expect_body_tail" param repeated in every test that uses this
helper makes the tests more readable).

I'm not sure that the assertion would really buy us much.  "TypeError: expected
a character buffer object" isn't great, but it's still clear enough.  "assert
isinstance" makes me uncomfortable in Python; it's generally better not to fight
the duck typing.

> > +    def test_get_smart_medium(self):
> > +        # For HTTP, get_smart_medium should return the transport object.
> > +        server = self.get_readonly_server()
> > +        http_transport = self._transport(server.get_url())
> > +        medium = http_transport.get_smart_medium()
> > +        self.assertTrue(medium is http_transport)
> > +        
> 
> ^- We have a helper function 'assertIsInstance' or 'assertIs', which
> should be better than assertTrue(... is ...)

Thanks, used assertIs.

> > +        # Doing a disconnect on a new (and thus unconnected) SSH medium
> > +        # does nothing.
> > +        medium = smart.SmartSSHClientMedium(None)
> > +        medium.disconnect()
> 
> ^- Is this actually testing that it does nothing, or just that it
> doesn't fail?

I guess the latter.

I've changed the comment to:

        # Doing a disconnect on a new (and thus unconnected) SSH medium
        # does not fail.  It's ok to disconnect an unconnected medium.

> > +    def test_ssh_client_raises_on_read_when_not_connected(self):
> > +        # Doing a read on a new (and thus unconnected) SSH medium raises
> > +        # MediumNotConnected.
> > +        medium = smart.SmartSSHClientMedium(None)
> > +        self.assertRaises(errors.MediumNotConnected, medium.read_bytes, 0)
> > +        self.assertRaises(errors.MediumNotConnected, medium.read_bytes, 1)
> > +
> 
> ^- Are we wanting to assert that read_bytes(0) raises, or is that just
> an implementation detail?

I think it's reasonable.  It's simply not valid to call read_bytes on an
unconnected medium.

read_bytes(0) is an edge case, but edge cases are still worth testing.

> > +    def test_ssh_client_supports__flush(self):
> > +        # invoking _flush on a SSHClientMedium should flush the output 
> > +        # pipe. We test this by creating an output pipe that records
> > +        # flush calls made to it.
> > +        from StringIO import StringIO # get regular StringIO
> > +        input = StringIO()
> > +        output = StringIO()
> > +        flush_calls = []
> > +        def _flush(): flush_calls.append('flush')
> > +        output.flush = _flush
> 
> ^- this seems a little bit evil, but not too bad. I would have thought
> you need 'def _flush(self): ...', but you don't because the functions
> should already be bound? Also, there is some confusion about using
> output.flush = _flush, and then calling medium._flush().
> Mabye calling it 'def logging_flush()'

Right, output.flush should be a callable that takes no args.  Normally it's a
bound method (of a function that takes one arg, "self"), but overridding it like
this it's just a plain old function.

Renamed to logging_flush, thanks.

> > +        vendor = StringIOSSHVendor(input, output)
> > +        medium = smart.SmartSSHClientMedium('a hostname', vendor=vendor)
> > +        # this call is here to ensure we only flush once, not on every
> > +        # _accept_bytes call.
> > +        medium._accept_bytes('abc')
> > +        medium._flush()
> > +        medium.disconnect()
> > +        self.assertEqual(['flush'], flush_calls)
> > +        
> > +    def test_construct_smart_tcp_client_medium(self):
> > +        # the TCP client medium takes a host and a port.  Constructing it won't
> > +        # connect to anything.
> > +        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > +        sock.bind(('127.0.0.1', 0))
> > +        unopen_port = sock.getsockname()[1]
> > +        medium = smart.SmartTCPClientMedium('127.0.0.1', unopen_port)
> > +        sock.close()
> 
> ^- s/unopen_port/not_open_port/

I've taken the third way: "unopened_port".  "unopened" is actually a word, and I
think is slightly nicer than negating with "not".

> > +    def test_tcp_client_connects_on_first_use(self):
> > +        # The only thing that initiates a connection from the medium is giving
> > +        # it bytes.
> > +        sock, medium = self.make_loopsocket_and_medium()
> > +        bytes = []
> > +        t = self.receive_bytes_on_server(sock, bytes)
> > +        medium._accept_bytes('abc')
> > +        t.join()
> > +        sock.close()
> > +        self.assertEqual(['abc'], bytes)
> > +    
> > +    def test_tcp_client_disconnect_does_so(self):
> > +        # calling disconnect on the client terminates the connection.
> > +        # we test this by forcing a short read during a socket.MSG_WAITALL
> > +        # call : write 2 bytes, try to read 3, and then the client disconnects.
> > +        sock, medium = self.make_loopsocket_and_medium()
> > +        bytes = []
> > +        t = self.receive_bytes_on_server(sock, bytes)
> > +        medium._accept_bytes('ab')
> > +        medium.disconnect()
> > +        t.join()
> > +        sock.close()
> > +        self.assertEqual(['ab'], bytes)
> > +        # now disconnect again : this should not do anything, if disconnection
> > +        # really did disconnect.
> > +        medium.disconnect()
> 
> ^- Why is _accept_bytes() the function which sends bytes to the other
> side? Is it possible to do something like this with public functions only?

I've changed it to use accept_bytes.

> And is there a reason you need to disconnect a second time? You make a
> comment that 'it should not do anything', but I don't see how you are
> testing that disconnect() is a no-op, versus it being an actual
> disconnect. (You have no return value, or visible side-effects)
> 
> It seems like it would be a bug for the second disconnect to be needed.

It's not needed, but at the moment with __del__ methods and things there's no
clear single place responsible for doing disconnection.  So it's likely that a
medium will have it's disconnect called twice, and until we tidy this part of
the design up, we want that to "work" (i.e. not raise an error if already
disconnected).

> v- These seem to be begging for an interface test, since you want to
> test that all 'Medium' implementations support this. (How many Medium
> implementations do you expect?)
> I can see that there are small differences, but is it possible to factor
> that out?

Yeah, it would be nice to structure it like that, but I think the differences
are just slightly too great to make it worth it.  Maybe I'm wrong, someone
(possibly me) should spend the time to try it out.

> > +class TestSmartClientStreamMediumRequest(tests.TestCase):
> > +    """Tests the for SmartClientStreamMediumRequest.
> > +    
> > +    SmartClientStreamMediumRequest is a helper for the three stream based 
> > +    mediums: TCP, SSH, SimplePipes, so we only test it once, and then test that
> > +    those three mediums implement the interface it expects.
> > +    """
> > +
> > +    def test_accept_bytes_after_finished_writing_errors(self):
> > +        # calling accept_bytes after calling finished_writing raises 
> > +        # WritingCompleted to prevent bad assumptions on stream environments
> > +        # breaking the needs of message-based environments.
> > +        output = StringIO()
> > +        medium = smart.SmartSimplePipesClientMedium(None, output)
> > +        request = smart.SmartClientStreamMediumRequest(medium)
> > +        request.finished_writing()
> > +        self.assertRaises(errors.WritingCompleted, request.accept_bytes, None)
> > +
> > +    def test_accept_bytes(self):
> > +        # accept bytes should invoke _accept_bytes on the stream medium.
> > +        # we test this by using the SimplePipes medium - the most trivial one
> > +        # and checking that the pipes get the data.
> > +        input = StringIO()
> > +        output = StringIO()
> > +        medium = smart.SmartSimplePipesClientMedium(input, output)
> > +        request = smart.SmartClientStreamMediumRequest(medium)
> > +        request.accept_bytes('123')
> > +        request.finished_writing()
> > +        request.finished_reading()
> > +        self.assertEqual('', input.getvalue())
> > +        self.assertEqual('123', output.getvalue())
> 
> 'accept_bytes()' really sounds to me like you are pulling new bytes to
> yourself, not pushing bytes to other.

In Twisted this would be dataReceived.  accept_bytes is a "push" interface, it
is called when there are bytes to process (i.e. it doesn't "pull", by calling
read or whatever itself).  So e.g. _serve_one_request_unguarded will call
request.accept_bytes as it pulls them off the connection, and the request will
react according.  What would you call this?

> > +    def test_construct_sets_stream_request(self):
> > +        # constructing a SmartClientStreamMediumRequest on a StreamMedium sets
> > +        # the current request to the new SmartClientStreamMediumRequest
> > +        output = StringIO()
> > +        medium = smart.SmartSimplePipesClientMedium(None, output)
> > +        request = smart.SmartClientStreamMediumRequest(medium)
> > +        self.assertTrue(medium._current_request is request)
> 
> ^- assertIs

Done.

> > +    def test_construct_while_another_request_active_throws(self):
> > +        # constructing a SmartClientStreamMediumRequest on a StreamMedium with
> > +        # a non-None _current_request raises TooManyConcurrentRequests.
> > +        output = StringIO()
> > +        medium = smart.SmartSimplePipesClientMedium(None, output)
> > +        medium._current_request = "a"
> > +        self.assertRaises(errors.TooManyConcurrentRequests,
> > +            smart.SmartClientStreamMediumRequest, medium)
> 
> ^- Can't you just construct 2 requests? It doesn't seem like you want to
> really assert the behavior based on a private member (_current_request)

It's doing white-box testing, testing the implementation directly.  Combined
with the previous test (test_construct_sets_stream_request) and the next
(test_finished_read_clears_current_request), it shows that precisely what we
want to happen happens: there can only be one current request at a time.

> v- Are the buffers meant to be public?
> 
> > +
> > +    def test_construct_version_one_server_protocol(self):
> > +        protocol = smart.SmartServerRequestProtocolOne(None, None)
> > +        self.assertEqual('', protocol.excess_buffer)
> > +        self.assertEqual('', protocol.in_buffer)
> > +        self.assertFalse(protocol.has_dispatched)
> > +        # Once refactoring is complete, we don't need these assertions
> > +        self.assertFalse(hasattr(protocol, '_in'))
> > +        self.assertFalse(hasattr(protocol, '_out'))
> > +        self.assertEqual(1, protocol.next_read_size())
> 
> ^- ***avoid the evil hasattr()***
> Perhaps:
> marker = object()
> self.assertIs(marker, getattr(protocol, '_in', marker),
> 	      "Protocol should not have a member '_in'")

Actually, as the comment hints, I don't need these assertions anymore :)

They were just ensuring that a refactoring of the class hierarchy was going the
right direction.

[...]
> > +    def test_bulk_data_pycurl(self):
> > +        self._test_bulk_data('http+pycurl')
> > +    
> > +    def test_bulk_data_urllib(self):
> > +        self._test_bulk_data('http+urllib')
> 
> ^- what if pycurl is not installed?

Then it probably goes "boom!" ;)

Fixed to raise TestSkipped.

> Why does Post need its own curl object? Vincent Ladeuil had worked out
> how to get _base and _range to share a curl object (just manually set
> the Ranges header).

Those changes hadn't (haven't?) landed at the time I wrote this code, so I
followed the existing idiom.  I'm not sure what's involved in sharing curl
objects.

> > === modified file bzrlib/transport/http/_pycurl.py // last-changed:andrew.benne
> > ... tts at canonical.com-20060926054631-7d63ee5c801fb098
> > --- bzrlib/transport/http/_pycurl.py
> > +++ bzrlib/transport/http/_pycurl.py
> > @@ -77,10 +77,12 @@
> >          if from_transport is not None:
> >              self._base_curl = from_transport._base_curl
> >              self._range_curl = from_transport._range_curl
> > +            self._post_curl = from_transport._post_curl
> >          else:
> >              mutter('using pycurl %s' % pycurl.version)
> >              self._base_curl = pycurl.Curl()
> >              self._range_curl = pycurl.Curl()
> > +            self._post_curl = pycurl.Curl()
> >  
> 
> v- There are quite a few typos here, can you clean it up so it is easier
> to read?

Fixed, thanks.

> > @@ -137,14 +197,13 @@
> >  #       the socket, and it assumes everything is UTF8 sections separated
> >  #       by \001. Which means a request like '\002' Will abort the connection
> >  #       because of a UnicodeDecodeError. It does look like invalid data will
> > -#       kill the SmartStreamServer, but only with an abort + exception, and 
> > +#       kill the SmartServerStreamMedium, but only with an abort + exception, and 
> >  #       the overall server shouldn't die.
> 
> ^- Shouldn't this kill the Protocol layer, not the Medium layer?

What happens currently (glancing at the code) is that the request will catch
that error, and send an error reply (and not disconnect).  That's probably a
reasonable thing to do, but it's not directly tested...

...so I've just added a test, and deleted this comment.  Here's the new test:

    def test_accept_bytes_with_invalid_utf8_to_protocol(self):
        out_stream = StringIO()
        protocol = smart.SmartServerRequestProtocolOne(None, out_stream.write)
        # the byte 0xdd is not a valid UTF-8 string.
        protocol.accept_bytes('\xdd\n')
        self.assertEqual(
            "error\x01Generic bzr smart protocol error: "
            "one or more arguments of request '\\xdd\\n' are not valid UTF-8\n",
            out_stream.getvalue())
        self.assertTrue(protocol.has_dispatched)
        self.assertEqual(1, protocol.next_read_size())

> Admittedly, I didn't go over the rest as closely, but I think the rest
> looks okay.
> 
> And I sort-of understand why you use 'accept_bytes', but I can say that
> it wasn't directly obvious when I read it. Maybe 'process_bytes()' is
> better, or maybe I was just reading it weird and not noticing that you
> were telling the server process to 'accept_bytes' rather than the client
> process.

Perhaps process_bytes is better.  I think I avoided "process" because in code it
tends to be too vague and general (like saying "do something").  It's probably
slightly better than accept_bytes, though.

> It also isn't clear why you have the 'get_smart_client()' and
> 'get_smart_medium()' calls, when they all need to be passed into a
> 'SmartTransport' object. Doesn't it make sense to have
> Transport.get_smart_transport() or something like that?

I think part of the problem here is that the get_smart_* API isn't actually used
yet; we've been intending for it to be used (and useful) for quite some time
now, and it's finally almost there, but it isn't actually used outside of the
test suite at the moment.

I expect what we want from this will become clear quite rapidly once we do the
work to e.g. automatically use a smart transport instead of a vanilla HTTP one
if we detect that the server supports it.

-Andrew.