[MERGE] Updated sftp_readv

Fri Dec 21 08:22:55 GMT 2007

>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> Vincent Ladeuil wrote:
    >>>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

<snip/>

    >> But why are you trying to do that ?
    >> 
    >> Because your coalesced offsets are so big that you don't want to
    >> totally buffer them ?

    john> Yes. But not just that.

What else ?

    >> 
    >> Why not make them smaller then ?

    john> Because then we also lose the ability to combine ranges.

No. I'm not talking about making coalesced offsets smaller then
32K, I'm talking about limiting their size when presented with a
lot of contiguous offsets. I.e. instead of having a 64MB
coalesced offset, I want 64 coalesced offsets of 1MB each, built
so that requested offsets are always in one single coalesced
offset.

    john> Specifically, we have 2 possible breakpoints.

    john> 1) The length of continuous data
    john> 2) The 32KB limit for a single SFTP request.

Fine. So the max_size should not be smaller than 32K, but I was
thinking in the MB range. 32K is sftp specific, add a check for
it.

    john> So we might get:
    ------> -------|--------|
    john>   r1     r2      r3
    -----------------------> 
    john>  coalesced
    ------------> ----------|
    john>   32KB         32KB

    john> If we changed the coalesce function so that it could
    john> only request a range of <32KB, then we would end up
    john> with

No, no, no. We don't want to do that. And you don't have anything
to do by the way since the coalesce function already have a
max_size parameter. And if a single offset is bigger than the
max_size, the max_size will be ignored.

    john> a) Files that are longer than 32KB would be
    john> unrequestable (ok, we could work around this one.)

Taken into account (max_size ignored in that case).

    john> b) In the above scenario, we would have to make 3 requests, for R1 R2 and R3.
    john> If we pretend that all of them are 20KB, then they fit into 2 32KB requests,
    john> (20*3=60) but we could not fit 2 into a single 32KB request. (2*20=40).

Right, the coalesce offsets never split a requested offset but
'max_size' help reducing the buffering.

    >> 
    >> I think the biggest offset a readv can be required to yield can't
    >> be bigger than a full text revision for a given file, users
    >> should have machines configured to handle that (they versioned
    >> the file in the first place don't they ?) ?

    john> Yes. We have to buffer the 10MB for the file we are
    john> requesting, as the readv() API doesn't give us a way to
    john> do anything else. (There are still issues of someone
    john> trying to add a 4GB ISO to their repo, but we are a
    john> long way from supporting that.)

Agreed. But you raise a very fine point here that may help me
explain my point.

How do we want to address reading a 4GB file ?

I don't try to address the whole problem here, only the transport level.

Since we don't want to buffer the 4GB, we should give access to
the file-handle to the upper layer.

Instead of returning (offset, data) tuples we can return (offset,
file_handle) tuples with file_handle already seeked at offset so
that the caller can just read (or something like that, forget the
details, the point is to provide the file-handle without
buffering).

If you keep that idea, you must provide an implementation where
you don't read the file yourself.

In fact that's the idea I had in mind when implementing the http
readv and it helps avoiding the intermediate bufferings.

    >> 
    >> So I'll go the same way than for http: limit the size of the
    >> coalesced offset, as long as you buffer the requests, that should
    >> not make any difference in terms of latency.
    >> 
    >> In fact doing:
    >> 
    >> cur_coalesced.ranges = new_ranges
    >> 
    >> is nothing more than doing that after the fact.
    >> 
    >> Or did I miss something ?
    >> 
    >> Vincent
    >> 

    john> No, we have already placed requests for all the
    john> data. The "new_ranges" is just saying "I've processed
    john> this sub-range, stop trying to process it in future
    john> passes".

Exactly, so there is no real differences between:

 coal(0, 1000, [(0, 100), (101, 400), (501, 500)])

and
 coal(0, 500, [(0, 100), (101, 400)])
 coal(501, 500, [0, 500])

(possible off-by-one errors are left as an exercise for the
reader ;-)

You pointed out in an another mail that the http implementation
looked simpler because the RangeFile was hiding part of the
complexity.

You were right.

Do the same ;-)

Provide a SFTPBufferedFile that will hide the buffering by
issuing the requests while respecting some max buffer size That
will be expressed as a size and translated into (requests
received, requests in the pipe, requests pending).

Then that class can issue enough requests to fill its buffers and
as these buffers get consumed, issue more requests to fill them
again. You'll have to find the right balance to avoid consuming
too fast or buffering too much, but I think the latency should
become nearly transparent once the firsts requests have been
received.

Is that clearer ?

Another way to think about it is looking at the comments, there
are two big ones, one is explaining an algorithm in 7 points and
another is explaining the indirections layers in 4 points.

That's far too much for a single method.

Don't get me wrong on that point though ! The comments themselves
are a gift ! Understanding the code without them will be quite a
challenge. But quite probable they contain the key concepts to
provide another clearer, simpler design/implementation.

         Vincent

P.S.: I hope you don't take my comments as rough critics, that's
not the intent. To be honest, when I implemented the http readv I
tried to change the sftp implementation so that I can reuse it
and I stopped because that was too hard from the actual
state. But even I you can't reach that goal, keep in mind that we
are trying to implement a custom buffering scheme for a file-like
object, because the ftp implementation may need it one day and it
will be nice to be able to reuse some code.