BundleReader, Containers, and file IO

Thu Oct 25 15:29:36 BST 2007

Aaron Bentley wrote:
> Andrew Bennetts wrote:
> > However, I'm not sure that for bundles we actually care about the requirement
> > that we never over-read; as far as I can tell, that code is just used with real
> > files that simply return '' when there are no more bytes available.
> 
> > So, is this approach reasonable?
> 
> I want to agree, but I do think it's reasonable to accept bundles from a
> pipe.  Can we support that still?

It is, but it requires effort.

The slow way is to avoid over-reading by only reading one byte at a time: i.e.
read one byte, push it into the parser, check if the parser is in a terminal
state, if not repeat...  I think realistically this isn't an option; the
overhead is just too huge.

One possibility is to have a bit of extra complexity so that we can be
conservative in how much we read if we are reading from a pipe (some fairly
simple logic along the lines of “if the parser is in state S, then there must be
at least X bytes left...”, plus some logic to track if we need to be
conservative for particular file-like object).  This isn't great, but it's not
too bad.

Another is to start going down the non-blocking I/O rabbithole.  This isn't too
bad on POSIX systems, but pipes and non-blocking I/O are far from
straightforward on Windows IIRC.  We only need fairly simple non-blocking
capabilities (just a way to do “block if there's zero bytes available, but as
soon as there are any please return them even if it's less than I requested”) so
perhaps this is possible without going crazy.  I'm pretty pessimistic though;
there's a reason why frameworks like Twisted exist :)

And of course there's the option to just maintain two largely independent
implementations: the existing blocking “pull-style” ContainerReader, and the
“push-style” ContainerPushParser.  This perhaps makes good sense; pull-style is
more convenient for certain tasks and certain optimisations (e.g. implementing a
reader that only reads record headers and fseeks over the bodies).

The smart server code is managing OK with the “extra logic to support
conservative reads when necessary” approach, so I think that'd be my preferred
option if we want this.  Maybe we can even reuse some of the code.  I don't have
a strong preference.

Probably the main concern I'd have isn't so much duplication of the
implementations, but duplication in the test suite.  If we can make sure the two
implementations are behaving consistently when facing the same data, then we can
probably live with the burden of two implementations.

-Andrew.