1.6 fetch regression

Thu Aug 28 00:29:37 BST 2008

On Wed, 2008-08-27 at 17:17 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> So I'm trying to put together some discussion about ways forward to improve
> branch/pull/etc (fetching in general). I did some testing over the loopback
> with 100ms delay (200ms ping).
> 
> $ time bzr1.5 branch
> $ time bzr.dev branch

What were the results ? :)

> For bzr.dev it is a lot harder to analyze, because things are spread out more.
>   3s  3s open an ssh connection and issue the "hello" request
>   8s  5s opening all the format objects, etc. Now ready to "fetch"
>  26s 18s Make 70 requests for .rix files. Most of these are 'readv()' [1]
>  39s 13s Make 13 .pack requests (I have 18 pack files here).
> 	 At least we don't seem to request anything 2x from the same file
>  48s  9s Make 18 readv && 18 get requests for .iix [2]
>  80s 32s Make 13 readv requests on .pack files for the inventories
>  94s 14s 18 readv && 18 get requests for .tix [same as 2]
> 134s 40s 13 readv requests on .pack files for texts
> 142s  8s 18 readv && 18 get requests for .six files
> 149s  7s 12 readv requests on .pack files for signatures
> 161s 12s Open the branch a couple of times just to make sure that we really
>          have the format we thought we had when we started 3 minutes ago
> 
> [1] One concern is that there are 32 requests for ...6aac8e.rix, with the
> final one being a '.get()' request. Almost all of them return 64KB, (one
> returns 128KB). The get returns 1.5MB.
> In the readv() requests alone, we've read 2.0MB of data. Or enough to have
> read the file 1.4 times.

Please try btrees - they are designed to fix these behaviours. The final
get() is being triggered by a very wide iter request I bet.

> [2] This seems to be a readv to read the header and a page of data. Once it
> has that it realizes that it really wants the whole thing, and issues a "get"
> request. For most of the get requests it *already has* the whole thing in memory.

yes, that makes sense - the heuristic for 'read the entire thing' is predicated on the key count, which is embedded in the index.

> Now, for 'bzr branch' just on the local filesystem, the new bzr.dev is *much*
> better. "time bzr branch" is 32s versus 1m15s (75s user).
> Which is a bit of a mystery why it is so much slower over a remote connection.

latency :) I'd try SFTP too which will use packer logic instead.

> So for whatever reason, adding 'bzr+ssh' into the stream adds 23 User seconds,
> and 54 total seconds. Though if parsing the stream is 23s, it might be 23s to
> *generate* the stream. And 23+23 is 46/53s (plus a few seconds for the ssh
> handshake, etc.)

Different code paths:
pack->pack is Packer.pack()
remote->pack is fetch()

I've been working at making remote->pack better but its obviously not
there yet.

> Over the local network I also see:
> $ time bzr.dev branch http://
>   35.41s user 3.40s system 69% cpu 55.926 total
> $ time bzr1.5 branch http://
>   38.61s user 3.30s system 69% cpu 1:00.54 total
> 
> (note that this repo is slightly different, but still a packed repo) I'm
> actually quite surprised to see that bzr-1.5 branching over http:// is
> *faster* than branching locally. (1m versus 1m15s).

async behaviour in pycurl, I'd guess.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080828/5c1476f5/attachment.pgp