State of the Ubuntu Error Tracker

Sat Aug 18 16:15:29 UTC 2012

This is a periodic update of the status of the Ubuntu Error Tracker project
(whoopsie-daisy).

*Get involved*

We're looking for people to help with the Ubuntu Error Tracker project.
There are lots of interesting tasks that need to be done, spanning a wide
range of technologies and skills.

This document will get you started:
https://wiki.ubuntu.com/ErrorTracker#How_you_can_help

Equally, do feel free to get in touch if you would like to help out, but
are confused or need more information.

*What we're working on*

Matthew is working on formulating a better algorithm for the "if all
updates were installed" line in the average errors per day graph:
http://people.canonical.com/~evand/tmp/IMG_5330.JPG

Brian discovered the source of a number of corrupt reports that were being
sent to daisy.ubuntu.com. If you hit the close button when apport is
collecting information, the report is still sent, but without the rest of
the data:
https://bugs.launchpad.net/errors/+bug/1020994

*Handling multiple errors at once*
*
*We need to coalesce multiple application error reports into a single
dialog and do the same for system error reports. I am actively working on a
branch to implement this functionality in apport.

https://wiki.ubuntu.com/ErrorTracker#When_there_are_multiple_simultaneous_errors
https://code.launchpad.net/~ev/apport/multiple-simultaneous-errors

*Redesigning the debconf dialogs to optionally send an error report when
shown*

The redesign work for the GTK debconf dialogs is complete and well covered
with a newly added test suite. Sending an error report when the box is
checked still needs to be implemented. I've had to put this down for now to
focus on improvements to errors.ubuntu.com needed by the release team and
the "handling multiple errors" implementation. Feel free to pick up from
the linked branch below. Just let me know you're working on it so that I
don't step on your toes.

https://wiki.ubuntu.com/ErrorTracker#When_there_is_a_debconf_prompt
https://code.launchpad.net/~ev/debconf/error-reports

*Optionally send an error report when an application hangs*
*
*
In the future, compiz will pop up an apport dialog when an application is
hanging, instead of only giving you the option to terminate the process.
The support for this in apport landed in 2.3 (r2423) via the --hanging
option. Sam pointed out that the solution as presented was not going to
work well. Matthew then reworked the UI and updated the specification
linked to below. These changes still need to be made to both the compiz
branch linked below and to apport. I haven't had time to make these changes
myself, so do feel free to pick them up. Just let me know if you do.

https://wiki.ubuntu.com/ErrorTracker#app-hang
https://code.launchpad.net/~ev/compiz/call-apport-on-hangs
https://code.launchpad.net/~ev/compiz/call-apport-on-hangs/+merge/113436/comments/243748
https://code.launchpad.net/~ev/compiz/call-apport-on-hangs/+merge/113436/comments/246738

*Laying the groundwork for creating bug reports from errors.ubuntu.com*
*
*
Right now crash-digger, the service that retraces error reports on
Launchpad and daisy's own retracers run entirely independent of one
another. Daisy then builds a mapping of crash signatures it's seen to bug
numbers that crash-digger has found. This means that right now we cannot
create bugs from the daisy retracers or http://errors.ubuntu.com.

The initial plan was to have crash-digger talk to a new daisy backend,
which would use the daisy database as a shared brain between crash-digger
and the daisy retracers. This requires some rethinking of how the backend
would behave compared to the existing Launchpad one, as daisy keys on crash
signature and crash-digger keys on bugs. There's also a lot of logic around
using new bug numbers when a problem is reintroduced, rather than reusing
the existing one, that dates back to before we had fine-grained
notification controls on Launchpad and would not work in the daisy backend,
anyway.

Martin and I came up with an ideal workflow for this back in June:
http://paste.ubuntu.com/1152707/

However, I've thought about this some more and it might be reasonable to
leave crash-digger well-alone and just duplicate the crash-digger bug
against the daisy-created bug at the point when daisy is importing those
links between crash signature and bug number.

Do feel free to investigate this one, but please do so as part of an email
discussion with myself and Martin Pitt.

https://code.launchpad.net/~ev/apport/daisy-duplicates-db
https://code.launchpad.net/~ev/errors/create-bug

*Charm the Error Tracker infrastructure*
*
*
There are a set of scripts in
lp:daisy/setup<http://bazaar.launchpad.net/~ev/daisy/trunk/files/head:/setup/>which
will let you set up some of the Error Tracker infrastructure in an
OpenStack cloud. However, this is hackish at best and has been abandoned
for charming the components instead. I've made some progress on charming
daisy, which should get you enough infrastructure to start reporting local
crashes into your Error Tracker instance. We still need improvements to
this charm and a charm for the errors.ubuntu.com Django site
(lp:errors<https://code.launchpad.net/~ev/errors/trunk>
).

This is a fairly easy one to tackle if you are comfortable with shell
programming.

https://code.launchpad.net/~ev/charms/precise/daisy/trunk
http://paste.ubuntu.com/1152635/

*Recoverable errors*

As of Ubuntu 12.10, you can programmatically generate an error report in
your application. Just feed nul-separated key-value pairs to
/usr/share/apport/recoverable_problem. If you supply a DialogBody key, it's
value will be used as the short description in the apport dialog that
appears. Do make sure you provide a DuplicateSignature key - a value that
uniquely groups a set of instances into a problem.

*Continued development of the errors.ubuntu.com website*
*
*
We've landed a number of changes to http://errors.ubuntu.com lately:

   - The graph now shows both the average errors per day for both Ubuntu
   12.04 and Ubuntu 12.10. However, the calculation we're using for this is
   incorrect. It divides the number of errors seen in the day by the number of
   unique systems seen in that same day. A more accurate measure will be to
   divide the number of errors seen in the day by the number of unique users
   seen in the past 90 days: https://bugs.launchpad.net/daisy/+bug/1033913.
   I've fixed this, but given a datacenter move it will have to wait until
   next week to be deployed.
   - In the most common problems table, if a linked bug is marked as
   completed, the entire line will be greyed out. If the "Last seen" version
   is not the latest version, then the "Last seen" version will be greyed out.
   This implies that the issue is no longer present in the more recent
   version. If the "Last seen" version is the latest version, but the linked
   bug is marked as completed, then the "Last seen" version will be marked red
   to indicate a possible regression. The code for this talks to Launchpad,
   slowly. We think this may be what's causing some timeouts to appear when
   loading the table. I have a deployment in process to allow us to turn this
   functionality on and off as part of the URL, which should help us get to
   the bottom of the problem.
   - You can now select a date range for the most common problems table.
   - The individual problem pages now show a graph of the number of
   instances over time. They also show a breakdown of the number of instances
   by version of that application.
   - The individual instance pages have been redesigned. They now look more
   like apport reports, with expanders hiding the fields you're unlikely to
   care about in the majority of cases.
   - There is an outstanding deployment request to move us to a better
   system for managing authentication. Soon you will no longer have to log in
   every time you view a problem or instance page. Nor should you get the
   errors that some people were seeing when attempting to authenticate.

https://code.launchpad.net/~ev/errors/trunk
https://code.launchpad.net/~ev/daisy/trunk
https://code.launchpad.net/~ev/oops-repository/whoopsie-daisy

*In the future*

Of course there is still plenty of work to be done. Feel free to grab
something and help out. Just let us know if you do.

https://bugs.launchpad.net/errors
https://bugs.launchpad.net/daisy

https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-crash-database
https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-metrics
https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-updates-from-crash-reports
https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-fix-ddebs
https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-bucketing-improvements
https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-phased-updates

*Interesting numbers*

There is not much variance in the number of unique systems that reported
errors in a day period. On the first day we recorded it, there were 67,565
unique systems. That dipped into the upper 40s and has since fluctuated
between the upper 40s and upper 50s. This does not at all imply that the
same systems are reporting every day, and indeed the data does not bear
that out.

Given a formula of the total number of errors reported in a day divided by
the number of unique systems reporting errors over a 90-day period, the
average system is experiencing 0.05 errors per day.

*Open tickets*

55342 <https://rt.admin.canonical.com/Ticket/Display.html?id=55342> - *Please
deploy lp:errors r146*

This is being working on this now, but likely wont land until after the
weekend. It provides http://errors.ubuntu.com/?launchpad=false, which lets
us turn off the launchpad integration and see if that has much of an effect
on the recent timeouts in the most common problems table. It also fixes the
"my error reports" URL (https://errors.ubuntu.com/user/sha512-of-system-uuid
).

55322 <https://rt.admin.canonical.com/Ticket/Display.html?id=55322> - *Setup
django-openid-auth backed by a database for errors.ubuntu.com*

This fixes the "OpenID from two Apache frontends" problems as well as
caching logins.

It also opens the door to having the default view of
http://errors.ubuntu.com be "errors that I am responsible for" as we can
match the group data from OpenID against the spreadsheet of package to team
mappings that Kate and the QA engineers created. Finally, it means that we
can further restrict and provide an audit trail when the server-side
package hooks get implemented.

53325 <https://rt.admin.canonical.com/Ticket/Display.html?id=53325> - *Need
an instance of jmxtrans talking to the crash database cassandra ring and
outputting to a (new?) graphite server*

This finally, finally gives us something for monitoring the health of the
Cassandra cluster by the moment, and lets us get the big picture view that
nodetool (Cassandra's console based stats program) does not provide. It
feeds the JMX data from Cassandra into Graphite. Included will be data like
the current and average read/write speed, the state of compaction, etc.

This also covers setting up statsd in order to get graphs of API calls and
failures from errors.ubuntu.com, as well as graphs of other non-persistent
data from daisy.ubuntu.com.

52506 <https://rt.admin.canonical.com/Ticket/Display.html?id=52506> - *Staging
setup for crash database*

This is complicated by the fact that we do not yet have a strategy for
feeding data from the production ring into the staging ring when the latter
is smaller, which is delaying the ticket.

55339 <https://rt.admin.canonical.com/Ticket/Display.html?id=55339> -
*Investigate
crashdb OOPS column diskspace tuning*

We're hitting some growing pains around the column family that contains the
actual error reports. We can ease compaction (housekeeping for performance)
by doubling the amount of I/O we do. Given that the OOPS column family is
heavily weighted towards writes (I imagine people mostly get what they need
from the problem pages; I could be wrong), I suspect we'll have to look a
bit further for alternatives.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-devel/attachments/20120818/fbf55b39/attachment-0001.html>