[RFC] Alternative to current push/pull semantics [use weave.join]

John Arbash Meinel john at arbash-meinel.com
Fri Dec 16 17:59:45 GMT 2005


I wanted to bring up an idea about how push & pull could be improved.
I've been trying to do some overhead measuring of the new chmod code
which sets permissions. On the local filesystem, with bzr.dev and
another project with 1600 files, I saw no measureable overhead.

A plain copy locally takes < 1s (after it is cached).  A bzr branch
takes 14-15s, and bzr-permissions took exactly the same.

Then I looked at copying it to over a local network ssh connection.
rsync wins hands down with 7-9s.
scp is quite slow, turning in a time of 2m10s
lftp is much faster. 1m20, and 30s with parallel push turned on. Pretty
good.

So then I did "bzr push sftp://newproject", and low and behold, it has
been around 1 hour, and the progress bar expects it to take another 24m

At least 1 problem is that we switched from 'bzr branch' which can use
clone, to actually recreating each revision on the remote side. I see a
download rate as fast as my upload rate (approx 400kB/s, which can be as
fast as 10MB/s)

But I've seen many times how slow bzr pull is. It may be smart enough to
only download a file 1 time, but I doubt it.
I believe the current work flow is: read both weave files, extract the
pristine text, add this text to the target weave, write the weave out.
Repeat for all changed texts for all revisions.

What I was thinking is that we probably could cheat, and instead of just
adding each text, just do a weave.join().
This means that we will be overzealous in what revisions we add, because
we pull together all weaves present in both weaves. And I still think we
should do the check that all of the texts are present for each revision
before we add the file to the revision-store.
But it would change the current semantics to:
	download the remote weave header
	see that it is missing the revision we want
	download the full remote weave
	read the local weave
	weave.join()
	save the remote weave

After that, even if we don't cache what weaves have what revisions,
future steps would just be:
	download the remote weave header
	see it has the revision we care about
	no upload needed

Now right now we can't just download the header. I think we really can,
but since our buffer size (32k) is greater than the average file size
(8k), it doesn't gain us much. I was thinking the same thing about
pipelining. Right now paramiko causes a sync on file.close(), which
means if we do "put(a), put(b)", and we call close() before the second
put(), we lose any sort of pipelining.
That was what put_multi() was all about, but it seems most people don't
like it. (At least in general I don't think it was used, and I know in
Jelmer's 'cleanups' for knit stuff, he removed the *_multi functions,
probably at Robert's request).

Anyway it wouldn't be too hard to say "if we retain the lock, we can
cache what revisions each weave contains". Heck, we could cache the
weave contents. Though we have an issue of where to we put it, since we
might be memory constrained (as in don't be a pig, not that we are truly
constrained). Local temporary files might be sufficient, but I know we
also stopped using the CachedStore for some reason (probably a desire to
do the caching based on Transactions).

I think this could give us some massive performance improvements. And
with knits, knit.join() is even faster (doesn't require recreating the
diffs, just fixing up some identifiers, and appending).

If I'm missing where this sort of optimization was intended, let me
know. I'm not stuck on doing it this way, I just think it could help a lot.

Also, another small side issue. If we just put some checks in the
current "clone.py" code to handle targets without working trees, I think
we could use "bzr branch local sftp://remote", and then "bzr push"
doesn't need the ability to create a remote branch.
Ultimately, I would really like to see "bzr branch sftp://remote/foo1
sftp://remote/foo2" use a remote copy. But the sftp protocol doesn't
seem to support a copy a to b on the remote side, so we are stuck
pulling the contents locally, and then copying it back.

So what do people think about doing a weave.join() rather than manually
copying one revision at a time. Coupled with making Transaction actually
cache the remote weave texts, I think we could get a push/pull that
doesn't horribly suck.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051216/662c6bdf/attachment.pgp 


More information about the bazaar mailing list