New overview page for the Bazaar importer

Fri Jan 8 00:52:39 GMT 2010

Hi,

After a few days spent doing some infrastructure work I'm now about to
present a more coherent interface to the current status of the bzr
importer. You can find it at

  http://package-import.ubuntu.com/failures/.bzr/failures/index.html

(Sorry for the odd URL)

This is not intended to be a page like MoM's, but an internal page, and
one that will become less and less important as we iron out the kinks.

Interpreting the page isn't easy, so I'll give a brief explanation here,
please ask if something isn't clear.

The page starts out listing the currently running packages (or at least
an approximation of them, if the service stops suddenly it will show the
last running packages). You can see how long each job has been running
for.

Next is the queue. These will be the packages that will be run
next. These will generally be triggered by a new upload, but the queue
is currently swelled by retries of some spurious failures. New uploads
will jump the queue ahead of the retries. Before I kicked off the
retries the queue was empty, so it shows that machine usage isn't really
an issue in keeping up to date.

You may notice that it always seems to be running something, even when
the queue is apparently empty. This is because it uses dwell time to
check its own work. As soon as anything enters the queue it will be the
next package to run.

Next is just a count of the failures. The intent is obviously to drive
this to zero.

To help with that it next displays the 100 most recent failures. This
means that new failures can be triaged.

To help with triaging a "signature" is generated for each failure and
the packages are grouped by these signatures. You can see these
groupings at the bottom, sorted by the number of packages affected.

Triaging the new failures is important because there are so many
spurious failures. Launchpad will often not respond kindly to the API
calls that are made, and so retrying the package is needed.

To help with this there are now layers of retry going on. The first
layer catches certain responses from LP and retries the request after a
backoff period, which helps some with the one-off issues. Sometimes
though LP is having more trouble than that, and so can't respond to
anything for a while. This breaks out of the retry loop and records a
failure. I've just added the ability for certain signatures to be marked
as spurious, and they will be retried after three hours, then if they
fail in the same way a second time they are marked for human inspection
so that if the data is causing the bad response we don't keep retrying
indefinitely.

This of course is designing for failure, which is vital. However, it's
the scale of the issue that has taken some getting used to. It may be a
matter of magnitude, with a lot of API calls being made, so even a small
failure rate translates in to a lot of issues, but it still seems like a
lot. I have of course been filing bugs on LP about issues that I can
identify, and spent some time today provoking bad responses and digging
in to the reasons to file some more bugs. It seems that a lot of the
problem now is the appservers refusing to communicate though, and I'm
not sure there's a lot I can do on my end to debug that.

If you look at the list of the last 100 failures you will probably see
some of this with clusters of the same signature, usually pointing to
network communication in some manner.

Thanks,

James