prefetch still broken with readv and paramiko 1.6.1

Tue Jul 25 17:28:19 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robey Pointer wrote:
> 
> On 24 Jul 2006, at 12:13, John Arbash Meinel wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I don't know what the specific problem is, but if you use prefetch and
>> readv with paramiko 1.6.1 it is returning bogus data.
> 
> It is probably returning the same data twice, or something even
> weirder.  They're not really meant to be used concurrently since they do
> basically the same thing, and you'll get no advantage from one if you're
> already using the other.
> 
> Looks like I have a FIXME for this case, in fact.
> 
> Even if it worked, prefetching would destroy any performance you'd get
> from readv, so probably not worth it.  readv was basically added so bzr
> could avoid prefetching the entire file. :)
> 
> robey

Well, I've found a few more bugs in the readv/prefetching logic.

The first concern is that installing paramiko 1.6.1 with bzr-0.8.2 will
cause it to try to use prefetch, and will break. So we either need an
0.8.3 that *never* uses prefetch. Or an explicit dependency on paramiko
<1.6 (technically < 1.5.5).

Further, I found a bug in 'readv()' if you are requesting large ranges.
It seems that 'sftp.read()' is only able to return a maximum range of
64KiB (65536 bytes)
If we just pass in the plain offsets as given by the user, it isn't
usually a problem, because most of our knit hunks are small. (At least
so far).
However, if we try to combine ranges (which I think should give slightly
better performance because we are making fewer requests on the server),
then it fails. When downloading inventory.knit I currently want 5.5MB of
data.

I don't know what the correct fix for paramiko is. Whether it should cap
the maximum length of an async request to 64KiB, and just make multiple
requests, or whether it should queue up the current results until it
gets enough for the requested data.

Looking at the _read_prefetch code, you make a lot of assumptions that
the request is going to be exactly requested.
For example, you do:
        index = max(pos_list)
        prefetch = self._prefetch_data[index]
        del self._prefetch_data[index]
Which means that you expect the position to match exactly, and for it to
contain all the data you need, and nothing that another request would need.

I could write SFTPTransport.readv() such that it doesn't combine, but
I'm concerned that for big trees, we might actually have a single hunk
that goes over the 64KiB boundary. Imaging a single large file,
especially an image of somesort, a single fulltext could easily be >64K.
So the other possibility is to have bzr do the workaround, and combine
requests, but then break them into 64KiB hunks. So it only ever makes
64K readv requests.

Generally, it seems this should be fixed in paramiko. Just that I can
think of workarounds at the bzr layer.

The real reason that I'm interested in this, is that for a complete 'bzr
branch' of a project using plain prefetch saves us almost 20% of the
download time. (492s versus 422s), though a full prefetch() hurts a lot
for just pulling part of the data (90s versus 128s).

In general, these results are indicative that using paramiko.readv()
could save quite a bit of time, since it does async requests. (And can
thus only requests the exact bytes). But we would need to solve the 64K
boundary before we could enable it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFExkajJdeBCYSNAAMRAq9+AKCoQaLtNX9lvu3iiN21KddunjI6zACbB1T7
r2qLzoNKI4faJ48kaYvw8qE=
=xLO+
-----END PGP SIGNATURE-----