[merge] readv out of order

Sat Jul 15 00:22:29 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Since I've been looking into the readv() stuff for http, I went ahead
and looked closer into it for other transports.

It turns out that rather frequently we actually try to readv() out of
order. One place where this happens a lot is when combining the
signature.knit. I think we are trying to read in ancestry order, which
signature.knit definitely doesn't follow.

But I think it also happens when we are reading other files, and they
aren't in 'merge_sort' topological sorted order. They are probably in
some sort of topological order, but not the one we are looking for.

Also, for sftp it is almost certainly better to combine more than 50
sub-groups into one bigger group. In fact, because sftp is pread style,
it is better to even collapse nearby ranges, depending on your latency
versus bandwidth numbers.

For me, latency to bazaar-vcs.org is 107ms, and my bandwidth is 160KB/s.
So it is break even to download 17Kbytes versus having another round
trip. Now, on my local network, this number is very different.
My local server is 240us away, and has a bandwidth of about 6MB/s (max
CPU throughput for ssh traffic).
So there, the break even point is closer to 1.4KBytes. Still much more
than 0, though.

I also realize that for really big reads, we don't want to cache *all*
of the read before we start returning some data up the stack.
So I implemented an algorithm which sorts and combines the offsets, and
starts reading into a dict. However, as it does each read, it checks to
see if it can satisfy any of the request, and if it does, it starts
yielding the data.

That way you still have forward only reading of the file, but you can
yield data early if possible. Now, the worst case for this is if the
first request is at the end of the file, and all other requests are
earlier. Then it does have to cache the whole file before it can yield.
Most likely in that case it would be better to read out of order, to
prevent caching so much data.

I realize Michael had a patch which allowed readv to return things out
of order. Which also helps with the caching issue. And we might want to
go that route. But it is pretty easy to do on top of what I've done.

On to the benchmarks, these are done without caching inventory.knit, so
it is being read twice. (and the second time is out of order, why??)

Time to branch bzr.dev from my server to my workstation. My bzr.dev
respository is rather cluttered (it includes mirrors of lots of other
bzr repos). Each entry is 'max chunks, max extra bytes' Original bzr
does not sort.

 84s	original, 50, 0
 58s	sorted, 50, 0

I also looked at how big the queue was getting (I have a mutter that
puts out a lot of info). And it does grow a little bit, depending on how
sorted particular files are. my builtins.py-knit had 50+ queued records,
 knit.py had 59, inventory.knit peaked around 290, signatures.knit hits
1467, and revisions.knit is the same as inventory, around 290. The final
local read to build the working tree is always satisfied (but they are
<50 and probably in sorted order so they pretty much have to be).

I also realized I was still using 'prefetch' which is good for big
copies like this, but bad for small ones (which are ultimately more
common you pull/merge more than do a full remote branch).

Disabling prefetch, cause a small setback. But I think I'll make it up
in the pull tests.
 60s	sorted, 50, 0

Bumping up the number of chunks to combine had a small effect.
Everything is pretty much requested in order. And only signatures.knit
keeps a queue. Files still take multiple round trips, because the data
isn't contiguous.
 57s	sorted, 500, 0

Setting it to 0 allows as many chunks as possible to be combined. Oddly,
this was slower, but probably only within the noise. We still have quite
a few entries which require multiple round trips, because we only
collapse exactly contiguous regions.
 58s	sorted, sftp 500, 0

Going back to 500 chunks, but allowing nearby to be collapsed
 56s	sorted, sftp 500, 1000

even more to be collapsed
 54s	sorted, sftp 500, 4192

this seems to pass my latency versus bandwidth:
 55s	sorted, sftp 500, 8192
 55s	sorted, sftp 100, 8192

Looking at the log output, the gaps are around 20k+ bytes. so lets get
the whole thing. At this point, pretty much every request is satisfied
in a single readv().
 54s	sorted, sftp 500, 30000

But lets make it force it. But this obviously leaves us with wasted effort.
 58s	sorted, sftp 0, 3000000

So for my setup, 500 contiguous regions and 4192 bytes seems a good
tradeoff. The buffer doesn't get super huge, but it still gets to do
nice long reads.

Now for testing of pull latency, as opposed to a full branch. First, I
branch at revno 1500, then pull overtop (this is pulling 3775 revisions)

 67.6	original, sftp 50, 0
 44.1	sorted, sftp 50, 0
 43.6	sorted, sftp 50, 0 # w/ prefetch
 45.5	sorted, sftp 500, 0
 46.1	sorted, sftp 500, 0 # w/ prefetch
 43.0	sorted, sftp 100, 1024
 43.6	sorted, sftp 100, 1024 #w/ prefetch
 43.0	sorted, sftp 100, 4192
 43.0	sorted, sftp 100, 8192
 42.0	sorted, sftp 100, 8192

And finally, branching to 1800, which still has 390 revisions to pull

 17.2	original, sftp 50, 0
 14.7	sorted, sftp 50, 0
 13.2	sorted, sftp 50, 0 # w/ prefetch
 14.0	sorted, sftp 500, 0
 14.4	sorted, sftp 500, 0 # w/ prefetch
 14.7	sorted, sftp 100, 1024
 14.0	sorted, sftp 100, 1024 #w/ prefetch
 14.1	sorted, sftp 100, 4192
 14.1 	sorted, sftp 100, 8192
 13.8	sorted, sftp 100, 8192 # w/ prefetch

There are lots of permutations to test, and it isn't 100% clear what the
optimum is going to be (especially if you change the latency + bandwidth
factors). And all these were pretty much only run 1 time, which isn't
very accurate. In testing, I found about 1s of slop.
But I can say, just sorting them into in-order brings us from 67ms =>
44ms or 50% faster.

But in general, I can say that in order reading, and combining nearby
regions does indeed improve our performance by a noticeable amount. (the
sorting being the most important giving as much as 50%, the nearby
adding an extra 10-20% improvement).

There is also some work that could be done to tune the local branch
operations.

The branch is available at:
http://bzr.arbash-meinel.com/branches/bzr/features/readv-combining/

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEuCc1JdeBCYSNAAMRAlkNAKDWVA0KckUuDuZ1xjkvD5fue/x+LwCeP2KL
+l12Bo4w0cHAKcVwmCr7nhY=
=F89O
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: readv-combining.patch
Type: text/x-patch
Size: 35838 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060714/ae0bc464/attachment.bin