Crash database (Paper on Windows Error Reporting design and experience])

Fri Aug 12 17:49:12 UTC 2011

There was some discussion a few months ago about crash databases on the
ubuntu-metrics@ list.  Attaching the report and Evan's commentary.

----- Forwarded message from Evan Dandrea <ev at ubuntu.com> -----

Date: Tue, 7 Jun 2011 16:35:06 +0100
From: Evan Dandrea <ev at ubuntu.com>
To: ubuntu-metrics at lists.launchpad.net
Subject: Re: [Ubuntu-metrics] Paper on Windows Error Reporting (crash database) design and experience

On Fri, Jun 3, 2011 at 4:30 PM, Elliot Murphy <elliot at canonical.com> wrote:
> This paper: Debugging in the (Very) Large: Ten Years of Implementation
> and Experience

This was an excellent read; thanks Elliot.

Windows Error Reporting (WER) provides two-way communication between
the server and the client, allowing the server to request additional
information from the client if it's already seen this crash before and
the developer needs more before they can proceed (page 2 and page 7,
section 5).

This was a common theme throughout the paper.  They've really
automated the communication with the user portion of the debugging
cycle to an extent that you really have to marvel at.  This helps
facilitate them to do things like leak detection in a subsequent run
on a client (page 9, section 5.4), without having to require any
additional interaction with a user.  Our process would require us to
ask people who are subscribed to the bug to try a
specially-instrumented build, with a traditionally very long feedback
loop between the developer and the bug subscribers.  Theirs is
entirely automatic.  Just wait for the next user who sees the bug to
click one "yes, I'd like to help make this product better" button.

We should replicate this.  I think in our design of an automatic crash
database, having the ability to set a flag so that the next user who
hits this stack trace, or whatever somewhat-uniquely identifiable
information for a bug we're using, automatically submits additional
data is a must.  Obviously apport does some of this, and while we
should always reuse code where possible, I don't necessarily think
bolting the existing python scripts onto this process by separating
them out of the packages and putting them in a more-frequently
updating pocket is the answer.

We should set a goal of looking at our existing bug feedback process,
which should remain in parallel to these efforts, and see just how
much of it we can automate, as they have done.  As mentioned
elsewhere, we should not require a Launchpad login or any other form
of massive user interaction to get what we need.  This should all boil
down to a simple yes or no question.

Another really nice feature that we should ensure we add to any system
we create is the ability to direct the user to a solution if the error
they're automatically reporting is fixed in a later version or through
a series of steps.  WER seems to largely point the user at web pages,
but I imagine we can do a bit better by hooking into the
update-manager or upgrade-manager and presenting those options as a
single mouse click, if they will in fact fix the bug in question.  We
can obviously still do the web page (not a wiki page, please) thing if
a workaround exists, but no fix is available yet.

Bucketing is the process by which crash reports are associated with
bugs.  Microsoft uses a set of heuristics to place a crash in the
right bucket, or to separate crashes in a single bucket into multiple
buckets.  The problem of bucketing requires constant refinement and no
perfect algorithm exists.  However, as they improve their heuristics,
Microsoft runs them back over the existing data (page 12, section
6.3).  This struck me as one of the most complex parts of the system,
and it is surely worth studying in greater detail what they have done,
as well as what Google, Mozilla, and other open source groups have
managed to come up with.  I'm assuming our existing system through
apport just matches on stack traces, but I could be very wrong.

Of course, it also provides an API for manually creating reports (page
7, figure 7), and we should most certainly ensure this is implemented
in any crash database system we create, as it is in apport, to
accomodate third-party application developers.

One feature definitely missing from our stack, but present in theirs,
is the ability to consider application hangs as a bug (page 5, section
3.1).  It also, through the Windows Shell, triggers a report if a
program fails to repsond for five seconds (page 7, section 5.1).
Given that compiz already has the ability to grey an unresponsive UI,
I'm presuming we could easily wire this up.

WER has the ability to flag crashes as regressions, working off the
existing data, and we should ensure that this gets built into the
automatic crash database.  Apport, to my knowledge, reopens the bug.
It can also detect rootkits and hardware failures, such as corrupt
memory.

Most importantly, all of this is built at a sufficiently low level.
When a piece of malware causes the Windows shell to constantly restart
in a loop, WER and Windows Update continue to run, allowing the user
to not only provide data on the crash over the Internet, but also
receive the fix as soon as it's available without resorting to a
recovery console (page 10).  Presumably we'd have to fix Network
Manager to connect without an active session, or teach the client side
portion of this to use the Network Manager D-Bus API to connect to the
last-known AP or wired network.  Ideally, we'd find a way for this all
to work very early on in the boot process, allowing us to catch early
stage failures.

Finally, I was really encouraged by the statement that, "WER has
changed the development process at Microsoft.  Development has become
more empirical, more immediate, and more user-focused (page 9, section
9)."  I hope we can bring the same kind of change to Ubuntu through
learning from their experience and that of Mozilla, Google, and
others, and leveraging that to build an automatic crash database of
our own, presumably on the back of existing technology.

-- 
Mailing list: https://launchpad.net/~ubuntu-metrics
Post to     : ubuntu-metrics at lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-metrics
More help   : https://help.launchpad.net/ListHelp

----- End forwarded message -----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glerum-sosp09.pdf
Type: application/pdf
Size: 961366 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/ubuntu-devel/attachments/20110812/41895d46/attachment-0001.pdf>