wheezy inhaling

Tue Feb 15 09:07:27 UTC 2011

Hi guys,

This week-end, I monitored (and helped a bit) the package importer
handling the backlog consecutive to wheezy release.

This is a bit long but many things occur when a new debian release is
imported, thanks for your patience ;)

For the impatients among you, some high level imprecise (TM) numbers:

- the 15,000 branches were imported in less than 3 days,

- the longest import was gnumeric with more than 2 hours (wall time),

- the average time to import a package is 5 mins (wall time).

This produced a big peek in the graph below:

   https://lpstats.canonical.com/graphs/BranchesPerDayNew/

Roughly wheezy required to create ~15.000 new branches and then there
was a couple of pending wrong attempts to clear. The package importer
had been turned off while lp was prepared for the rush and it was
restarted Friday 11 Feb circa 11:00AM UTC (all dates below are in UTC
too).

http://webnumbr.com/outstanding-package-imports tells the complete story
from the package import point of view.

At 3 packages/min the first estimation was that it would finish circa
Monday 14 Feb 12:00 UTC at best and more probably Wed 16 Feb.

I thought it was a good time to make some load experiments since we
won't have another debian release to watch until... some unknown date ;)

So I went to jubany and progressively raise the number of threads (by
doing 'echo 12 >max_threads' in the 'new' directory.

 2011-02-12 00:17:38.185000 - Concurrency is now: 8

Seeing that things went well, I continued adding 4 threads, watching
https://code.launchpad.net/+recently-registered-branches to make sure
the branch scanner wasn't blowing up and finally reached 32 threads.

 2011-02-12 10:53:53.084000 - Concurrency is now: 12
 2011-02-12 11:04:57.763000 - Concurrency is now: 16
 2011-02-12 11:09:54.381000 - Concurrency is now: 20
 2011-02-12 11:23:32.988000 - Concurrency is now: 24
 2011-02-12 12:25:38.213000 - Concurrency is now: 28
 2011-02-12 12:58:42.076000 - Concurrency is now: 32

Branch scans are queued (thanks to wgrant for clarifying this one), so
we can't DDOS the lp branch scanner, which is good. Nevertheless, I saw
the number of branches needed to be scanned raise as I raised the number
of threads. 

So far, so good, the progression wasn't linear and the branch scanner
"lag" quite erratic anyway, no real correlation there. Just a couple of
data points: at 3p/m the number of branches with a 'This branch has not
been scanned yet' was ~10 and the message disappear (the so-called lag)
in less than 2 minutes. At 6p/m, I've seen the lag goes up to 10 mins
and the number of branches waiting to be scanned as high as ~40. But the
max values were reached only rarely, on average the branch scanner was
around ~2min behind with ~15/20 branches to be processed.

No direct correlation either between the number of threads and the
number of packages by minute.

With 8 threads with reach ~3 packages/min, with 32 ~6 packages by
minute. jubany wasn't swapping, I made sure there was always at least
2GB free out of the 32G. This went down to ~1GB at one point before
stabilizing at 8GB free (i.e. more than the initial ~6GB with 8 threads,
weird, but I didn't investigate either, the buffers (reported by top)
stayed at ~10GB the whole time and no swap was ever used).

There should be some bottleneck here but I didn't search for it, I
mostly wanted to collect a set of data points to establish some
comparison basis for the future. So no changes, except for the number of
threads and given the observed behavior I didn't overload any part of
the tool chain so the numbers should be valid.

A package import starts at ~45M and generally ends after < 1min CPU time
reaching ~ < 100M. Some imports goes up to ~150M, I saw a couple of
ones reach 200M and at least one 500M (gnumeric).

32 * 500M = 16G, I thought I was safe leaving the p-i run with 32
threads.

There was some transient librarian failures:

  Invalid http response for https://launchpadlibrarian.net/24134878/oneisenough_0.40-2.dsc: Unable to handle http code 503

The first one occurred in isolation and wgrant looked around on the lp
side of things without founding real evidence (apart from spm mentioning
a +10 minutes to restart the librarian :-/).

There have been ~100 failures related to the librarian returning 503
http errors which are transient and I observed a couple of them
occurring before the librarian came back to normal in less than 10
minutes (i.e. too small a window to attempt a more precise diagnostic,
but we have logs so we will be able to investigate and cross our logs
with the librarian ones).

A couple of ones below for those with access to the librarian logs:

  2011-02-13 00:13:08,126 - __main__ - WARNING - Importing octave-communications failed:
  bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/40039776/octave-communications_1.0.10-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 

  2011-02-13 00:13:28,751 - __main__ - WARNING - Importing jester failed:
  bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/18993704/jester_1.0-8.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 

  2011-02-13 00:13:28,785 - __main__ - WARNING - Importing monodevelop-database failed:
  bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/55583192/monodevelop-database_2.4%2Bdfsg-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 

  2011-02-13 00:13:28,789 - __main__ - WARNING - Importing bzr-rewrite failed:
  bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/58283982/bzr-rewrite_0.6.1%2Bbzr225-1.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 

  2011-02-13 00:16:55,411 - __main__ - WARNING - Importing potool failed:
  bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/56057698/potool_0.11-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 

I could provide more if needed, the log files mentioned ~100 of them.

Since I was becoming impatient and things were going well, I went ahead
and push the importer some more for a short period:

 2011-02-13 00:33:27.476000 - Concurrency is now: 32
 2011-02-13 18:24:47.175000 - Concurrency is now: 40
 2011-02-13 18:33:55.523000 - Concurrency is now: 64
 2011-02-13 18:34:45.963000 - Concurrency is now: 62
 2011-02-13 18:35:26.124000 - Concurrency is now: 60
 2011-02-13 18:35:56.346000 - Concurrency is now: 57
 2011-02-13 18:36:36.756000 - Concurrency is now: 40
 2011-02-13 18:37:47.845000 - Concurrency is now: 39
 2011-02-13 18:38:18.547000 - Concurrency is now: 32

No trouble.

Later on, we hit an obscure failure:

  2011-02-13 19:56:59,671 - __main__ - CRITICAL - Driver hit database is locked

This is probably due to a race condition when starting the imports, the
actual code doesn't wait for one import to start before starting the
next one (https://bugs.launchpad.net/udd/+bug/719196). It's even more
amazing to have encountered it *after* I pushed to 64 threads and went
back to 32...

To be safe, I lowered the concurrency to 8 again to let whole wheezy
import finish (thanks to spm for promptly restarting the importer).

 2011-02-13 23:06:49.215000 - Concurrency is now: 8

Overall, I'm impressed by the robustness of the *whole* system[1]
(i.e. launchpad + the package importer).

An important point is that we want to be able to temporarily raise the
number of threads in the future to address such huge imports. We need
more monitoring and more data collecting[2] to refine what resources we
really need between this big imports.

In parallel, my package-import.local vm has now reach '0' outstanding
jobs too. Of course it runs in read-only mode with package_import.py
being called with '--no-push; and currently have 16403 failures due to
some access misconfiguration ;) But it runs and that's a nice first
step.

Congratulations for staying with me so long !

       Vincent

[1] And while I'm working on making the p-i more robust, I'd like to be
clear, for the record, that James left us an already robust system and I
thank him for that. So, in the future, if you hear me say or see me
write: "the p-i sucks", I'll be talking about the remaining defects and
about the bugs we will add ;-)

[2]: The obvious candidates being memory peak and CPU/elapsed times (we
can (and I will) extract the later from the existing logs, see
bzr+ssh://bazaar.launchpad.net/~vila/udd/analyze_log/.