Retrospective on today's britney outage (10:42-20:55 UTC)

Julian Andres Klode julian.klode at canonical.com
Tue May 4 21:26:08 UTC 2021


britney outage from 10:42 to 20:55 (UTC)

At 17:47 (all times UTC) today I was informed by RikMills
that britney had not been running successfully since about
11:00, because autopkgtest.com/results was eventually 503-ing.

By 18:15, I had mistakingly identified this as a swift issue
and asked IS to investigate.

By 18:40 I heard back from IS that they did not see errors,
and looked at the autopkgtest-web logs a second. Investigation
showed that the issue was haproxy disabling our cloud workers
which both failed at the same time (well, regularly) due to
"database disk image is malformed" errors from sqlite.

By 19:00 Laney started working on moving the /results proxying
out of the Apache servers and directly into haproxy, and released
that at 20:34.

Meanwhile I started working on fixing the error by replacing
our simple file copying code with the SQLite online backup
API, as waveform had suggested earlier. That work was finished
at 20:56, after we finally figured out how to give me charm
store access (grant --channel unpublished did the trick).

I identified some Work items for the future:

* some alerts other for britney failing, as relying on community
  members to report 7 hours after the first failure is not super helpful.

* We probably also need some monitoring that alerts us of the high
  failure rate we had on the web servers.
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en



More information about the Ubuntu-release mailing list