Analysis: Time spent in committing in a checkout

Sat Sep 9 02:35:31 BST 2006

I'm trying out using a checkout of a Launchpad branch. My ping time is
approx 100ms, so 'bzr commit' is rather slow. I thought I would
investigate what steps are involved, which might shed some light as to
why that is. I did a --lsprof of a simple commit which only changes one
file. And then tried to break down what was going on.

To start with --lsprof says that the total time spent is 24s.

From the top-down, we spend all that time in commit(), with this being
the basic breakdown:

  8.65s bzrlib.repository:243(fetch)
  6.77s bzrlib.commit:349(_check_bound_branch)
  3.31s bzrlib.repository:2028(commit)
  1.80s bzrlib.commit:391(_cleanup)
  1.18s bzrlib.commit:444(_gather_parents)
  0.69s bzrlib.branch:310(last_revision)

There is also:
  1.00s bzrlib.decorators:48(write_locked)
But unfortunately, our @needs_write_locked stuff makes it difficult to
follow certain code paths.

We can also look at it from a bottom-up approach.
It turns out that 16s is spent in 132 calls to
paramiko.sftp:164(_read_packet)

1 call, which takes almost 1.5s is paramiko.sftp:105(_send_version).

I assume that 1.5s is the time it takes to instantiate a sftp connection
with the server. I don't think we can do much about that.

The rest of the 16s is spent in paramiko.sftp_client:600(_read_response)
(112 calls, 14.5s). This is basically called anytime we do an action in
paramiko, it is the function paramiko uses to understand server response
information.

The time here is relatively stable, with each _read_response taking
approx 120ms. (They vary from 100 -> 200ms for some reason. I don't know
why some take so much longer. 100ms is very much on par with plain
latency, 200ms seems like 2x latency.)

Of those 112 responses, 11 are write requests, 88 are
paramiko.sftp_client:571(_request), which is a generic request for all
sorts of stuff, but ones that don't expect data back. And 13 are
'_read_prefetch'. Which would indicate that we did a plain get() on a file.

So what are we doing that takes 112 requests of the server. Well, lets
see...

1) We open a Branch and a Repository, and lock them both. This requires

   430ms
   a) One stat() call each, so that we can get the desired file and
      directory modes. For a total of 2 stats. For some reason, these
      take 200ms, instead of 100ms each.
       2   0.4332  paramiko.sftp_client:293(stat)

   2480ms
   b) We create 2 lock directories (1 branch, 1 repository). This
      actually takes quite a few steps.

      440ms
      b1) 2 mkdir() calls (which seem to take 200ms instead of the
          expected 100ms).
           2   0.4363  paramiko.sftp_client:265(mkdir)

      670ms
      b2) 1 append_file() call, which should be changed to
          put_file_non_atomic. Since append() must do a stat, and then
          a write
          +2   0.6690  +bzrlib.transport.sftp:564(append_file)

      440ms
      b3) 2 rename() calls to put the new directories into place
	  +2   0.4379  +bzrlib.transport.sftp:580(rename)

      900ms
      b4) 2 calls to LockDir.confirm() to make sure that we have
          actually succeeded in locking the directories. This requires
          doing a get() of the contents of the file in the held/* dir.
	  All of the time in confirm() is actually spent in peek() which
	  is calling _read_info_file(), which is spending some time in
	  _parse_info() and some time in sftp.get().

	  get() turns out to be fairly more expensive than it should be.

	  1) round trip to open a file handle (~100ms)
	  2) stat() call, so that it knows how many bytes to prefetch
	     (~100ms)
	  3) The actual prefetch request seems to be asynchronous, which
	     is good
	  4) one more round trip when we actually go to read() the
	     contents. I assume some of this is because the prefetch()
	     hasn't had any time to work yet, and this is a small file.

	  So a get().read() is 3 round trips to the server. Now, I seem
	  to recall profiling w/ and w/o prefetch, and for large files
	  prefetch does a lot better. I believe this is because we can
	  only request 32KB at a time. So without prefetch, we have to
	  wait a whole round-trip for the each 32KB chunk. While with
	  prefetch, we just make a stat(), then some async requests, and
	  then let the data stream back.
	  (openssh supports 64K requests, but the spec only guarantees
	  32KB)

	  But I'm wondering if we could give hints to get() to indicate
	  that we know this is a small file, so don't bother
	  prefetching. Alternatively, if we could implement get_bytes()
	  such that it could do everything in a single round trip.
	  Making all of the requests asynchronously. However, I think
	  it will take a minimum of 2 round trips. 1 to open the file
	  handle, and you need to read the number before you can place a
	  read request on that handle.

   1810ms
   c) We unlock both directories. This also involves several steps

      911ms
      c1) 2 more calls to LockDir.confirm(). These act the same as (b4)

      450ms
      c2) 2 more calls to rename() to move them out of the way

      220ms
      c3) 2 calls to delete() to get rid of the old files

      220ms
      c4) 2 calls to rmdir() to get rid of the old directories

So just the lock and unlock steps take up 4.3s (17%) of the entire time
to push a single change which modified a single file.
I just submitted a patch to use put_file_non_atomic(), which should
remove 200ms of the time. I also wonder why some commands take 200ms to
complete, and others only take 100ms. (Like rename() versus delete() or
rmdir()). This may just be random fluctuations.

I think the biggest gain here would be if we could redefine get_bytes()
so that it is only 1 round trip, rather than 3. That would save us time
in 4 calls to LockDir.confirm(), for something between 800 -> 1200ms.

Also, this shows us where a smart server can be much more efficient.
Because a 'Lock This Branch' request can automatically lock both the
branch and the repository in a single round trip (100-200ms) rather than
taking 4s.

I also wonder if we could get Paramiko to allow some of this stuff to be
pipelined. Like when taking a lock, a whole lot of that could be a big
pipeline that can be stalled at any time, but we don't really need to
sync until we read the contents of the final lock directory. I realize
opening a file in sftp is a required sync point, since you need the file
handle before you can read or write anything.

Now, to make a change to a single file, we actually update 8 revision
control files. 4 index files, and 4 data files. (text.*, inventory.*,
revisions.*, and signatures.*).

This is 8 append_bytes() calls, which takes 3.1s (less than the time it
takes to lock and unlock the remote directory) Each one takes *almost*
400ms, which in round trips would be:
  1) Open a file handle
  2) Stat the remote file to determine its length
  3) write out the contents and close

Some other times are rather surprising considering the branch involved
is very small, and doesn't have much history. I think some of this is
--lsprof penalizing function calls, though.

We have to do 4 knit joins (as described above). The interesting stats are:
     4  5.6062  0.1274   bzrlib.knit:1560(join)
+21453  0.2398  0.1607   +bzrlib.knit:509(has_version)
    +4  0.4071  0.1089   +bzrlib.versionedfile:307(get_graph)
    +4  1.6765  0.0048   +bzrlib.tsort:27(topo_sort)
  +152  0.0017  0.0011   +bzrlib.knit:872(get_parents_with_ghosts)
    +8  0.0036  0.0002   +bzrlib.knit:1454(read_records_iter_raw)
    +4  3.1369  0.0002   +bzrlib.knit:384(_add_raw_records)

I already sent a patch for this. Since I think we are accessing too many
inventory records, which causes topo_sort to take too long.

So, I didn't finish the analysis before I need to stop for the night.
But I have the lsprof results in case anyone else is interested.

Another place that I think could benefit is in the _KnitIndex.__init__
function. We do a fp = transport.get()
to get the file contents, and then a fp.readline(), followed by
fp.readlines().

I have the feeling that this is causing 2 round trips to the server.
Looking at the paramiko code. readline() defaults to using a 8K buffer
for the read requests.

Where this benefits us is if the index was corrupt, we wouldn't have to
read the entire file. (Which would matter if we somehow got a 600MB bad
index file because someone copied it into place.)

However, I think the common case is that index files are just fine, and
it seems a shame to punish them so strongly. (Because of the potential
worst-case behavior of a bad file put in place of a .kndx, I'm not 100%
about this, but I'm 95%).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060908/96421071/attachment.pgp