[MERGE] Implement chunked body encoding for the smart protocol.

Vincent Ladeuil v.ladeuil+lp at free.fr
Thu Oct 25 10:43:11 BST 2007


>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

<snip/>

    john> ^- I realize these are the bare minimums of what
    john> *could* be coming next.  But I feel like you are really
    john> pessimising things. I suppose you are just trying to
    john> not consume too many bytes from the socket, but I would
    john> rather see some sort of peek-ahead buffering layer. It
    john> could be pretty simple, and means that you don't have a
    john> sys-call just because you aren't sure how many
    john> characters they are going to use for the length. In
    john> fact, what if we just made the length string
    john> fixed-size, especially for chunked data.  There
    john> probably isn't a good reason to have chunks > 100k, so
    john> why not just make the length string 5 characters long?

Having played a bit (a big bit) with binary encoding of various
messages (say, around 400 for 10 or 20 different protocols), my
summary is: the only reliable way to ensure a safe decoding is to
always and only use length-prefixed formats.

That address the following problems:
- avoid reading ahead too much,
- detect corruptions.

All others variations fail to meet these two criteria, either
because they use some delimiter-suffixed part (like for example a
'\n' at the end of a string) or embed some dynamic part into the
format (like, for example, some layout of the format depending on
one specific value in a field).

There are limitations I won't go into here, but basically you
encode everything as TLV (Type, Length, Value). For each field you
specify a format (which defines how to read the value which may
be prefixed by a length). You can generally define several fields
inside one enclosing format, fields themselves can define their
own format recursively if needed (but these are exceptions rather
than the rule).

As long as the Ts are constant, only Ls and Vs have to be encoded
on the wire, granted that encoder and decoder agree on Ts.

That was a big part of the transportstats experiment
(https://code.launchpad.net/~v-ladeuil/bzr/transportstats).

The design is such that you define a format like
'%(base)s%(relpath)s%(bytes_read)L%(latency)H' for the
Transport.get method (for example), 's' (string), 'L' (unsigned
long 32 bits) and 'H' (unsigned short 16 bits) being the types
here.

This format is "compiled" only *once*, generating the encode and
decode methods which receive or produce a tuple: (base, relpath,
bytes_read, latency). That means that the format is user-friendly
while the implementation is not performance-hostile (you don't
scan the format each time you want to decode or encode).

This is currently done in python only, but a C version will be
easy to implement if performance becomes a problem.

An interesting result is that by using a few tricks I obtained a
10/1 ratio between the text representation and the binary
representation of this tuples, i.e. roughly what I expected from
a gzip compression.

Also note that if the binary format is not human-readable*,
*displaying* it in a human-readable format *is* trivial.

Ensuring endianness support (which is generally addressed by using
text as in http), is hidden behind the format layer which always
encode in network-endianness representation.

Final note, the transportstats plugin is a work in progress, the
current implementation already demonstrates the potential, I'll
complete it as time permits.

  Vincent

*: After a few years of practice I can read it fluently in hexa,
 but I'm not human anymore in that area :)




More information about the bazaar mailing list