[UGLY HACK] Proof of concept multipart/byteranges support and connection sharing

Michael Ellerman michael at ellerman.id.au
Fri May 19 10:15:42 BST 2006


Hi guys,

There's been a bit of whinging .. er, comment, about the speed of bzr
lately, as well as comparisons to mercurial etc. In particular I've been
interested in the branch performance over HTTP, as I think it's an
important feature to support plain HTTP as a first class transport.

The obvious optimisation for bzr at the moment is to make readv() do a
proper[1] HTTP range request. So last night I hacked up a _really_
horrible implementation, just to see what sort of speed improvement that
might get us.

While I was there I thought I'd add keep-alive support for HTTP
connections, which is pretty basic with libcurl. There were a few
wrinkles[2], but this pretty much works.

The following numbers are for a branch of bzr.dev from bazaar-vcs.org
(which is in the UK?) to a server in Australia. They're on a reasonably
busy multiuser system, so there could very well be noise here.

I didn't really believe it made this much difference, but I've run these
a few times, and I think I'm not going crazy. (this is just "time bzr
branch foo bar").

Current bzr with urllib:
-----------------------
real    49m11.575s
real    50m39.638s
real    48m41.222s

Current bzr with pycurl:
-----------------------
real    50m8.651s
real    48m44.026s
real    50m18.859s

Hacked up mulitpart + connection sharing:
----------------------------------------
real    18m42.352s
real    20m26.648s
real    19m20.530s
real    17m56.541s
real    17m45.153s


As far as I can tell, the end result is mostly identical:

michael at ozlabs:~/bzrtmp$ bzr check bzr_stock
checked branch /home/michael/bzrtmp/bzr_stock format Bazaar-NG Metadir
branch format 5
  4917 revisions
 11663 unique file texts
1090256 repeated file texts
   493 weaves
     2 ghost revisions
     2 revisions missing parents in ancestry

michael at ozlabs:~/bzrtmp$ bzr check bzr_patched/
checked branch /home/michael/bzrtmp/bzr_patched format Bazaar-NG Metadir
branch format 5
  4917 revisions
 11663 unique file texts
1090256 repeated file texts
   493 weaves
     2 ghost revisions
     2 revisions missing parents in ancestry

And here's a diff of 'find * -type f | xargs md5sum'. Obviously the
branch-name is different. Is it ok that the inventory.knit differs?

--- stock.sums        2006-05-19 18:31:06.166187966 +1000
+++ patched.sums        2006-05-19 18:31:24.544297888 +1000
@@ -989,13 +989,13 @@
 f72288870ce2419023d3b8a1c86190d0  /.bzr/repository/inventory.kndx
 d16203c04616ebdfb75f2ae6e987aa85  /.bzr/repository/format
 96e17deec04ed7e0d5c53c35d8138444  /.bzr/repository/revisions.kndx
-dd3e6c7c5e9c791978f87f3aa7bb848b  /.bzr/repository/inventory.knit
+c33e2a2d08d017868836941633e93e43  /.bzr/repository/inventory.knit
 c0344a5f735734a581874f738a9e90c9  /.bzr/repository/signatures.kndx
 0581a8f896482ba910f729b62bae0672  /.bzr/repository/revisions.knit
 0042103592c95dc4041e9d4eff8fb852  /.bzr/repository/signatures.knit
 a03237414e4bda1877f9783b816316f4  /.bzr/branch/revision-history
 3985e65daf770f58069e7d12734b77b5  /.bzr/branch/format
-15b4714de60d9b1d68c9dea88025bced  /.bzr/branch/branch-name
+5aae591edf42d75626dca1b9d2574214  /.bzr/branch/branch-name
 7978815a521840d60c0c7ad88a508ecc  /.bzr/branch/parent
 320126a2be85c99c74f3366fb978f177  /.bzr/checkout/stat-cache
 9b0cd2ae56ff1abc16d4de4175af2933  /.bzr/checkout/format


The code is a long way from being ready for inclusion, this will not be
in your next dist-upgrade. It's fragile, doesn't cope with errors
properly, is full of hacks and is generally badly written. Any help
cleaning it up will be appreciated :D

I've attached the patch for the adventurous, and it's also sitting on
top of this branch:

http://michael.ellerman.id.au/files/bzr/repo/http/


cheers


[1] Currently we only ever request contiguous ranges. ie. If we're asked
for 10-20,20-30 we'll do one request for 10-30. But if we're asked for
10-20,30-40 we do two requests. This sucks, in some cases we do > 500
requests on one file.

[2] Unfortunately we're creating about 15 PyCurlTransport() objects, so
to see much improvement we have to share the Curl() object globally.
Yuck. Also it seems (??) you can't unset pycurl.RANGE/NOBODY, so we have
to have three Curl() objects, one for GET, one for HEAD and one for GET
+ Range.

-- 
Michael Ellerman
IBM OzLabs

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pycurl-hack.patch
Type: text/x-patch
Size: 12551 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060519/4210246f/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060519/4210246f/attachment.pgp 


More information about the bazaar mailing list