Missing crash reports for long-running processes

Thu Aug 20 21:01:58 BST 2009

On Thu, Aug 20, 2009 at 08:37:53AM +0100, Matt Zimmerman wrote:
> Because the running X server is generally not restarted when it is upgraded
> (because this would destroy the user's session), this seems like a common
> case for users running the development branch.

It shouldn't be restarted automatically, but the user probably should be
strongly encouraged to restart after upgrading any of the core X bits.
In theory, the system should run fine with one set of things installed
and another older version actually running, however for QA purposes such
systems are going to be a PITA to debug, and I'm sure apport and other
QA tools have assumptions that $thing_running == $thing_installed.

> Of course, this particular bug report would have been suppressed anyway, because the
> package was out of date, and furthermore, according to the backtrace in the
> log, my crash was a duplicate of
> https://bugs.edge.launchpad.net/bugs/343528.

Out of curiosity, did updating to a version with my patch solve the
crash, or no?

> This got me thinking, though, about the general case here.  This type of
> long-running process will generally *not* leave a crash report behind when
> it crashes, unless it happens very early in its lifetime (before the next
> package upgrade).  It seems like we may be missing out on the benefits of
> apport in this case.
> 
> Is this worth fixing?  I could imagine some simple changes to save the
> running Xorg binary for use by apport, if this would be useful.

Quite possibly, although we certainly have no shortage of apport X crash
reports that need to be gone through already, so I wouldn't attach a
very high priority to developing such a capability.  There's probably a
lot of other apport changes which could provide higher bang for the
buck.

Also, I've been noticing that X crashes fall into two categories -
trivial ones (null pointer derefs, out of bounds errors, etc.), really
hard ones.  The first group I can usually address in Ubuntu by just
adding better checks (upstream prefers instead to dig down to the root
cause of the problem, and considers checks to be "papering over" the
problem, but I figure if papering over means less crashing, that's fine
for us.)

The really hard ones usually require a depth of analysis beyond what
just a backtrace provides.  Upstream often requires a detailed test case
before they'll even look into it, even if the issue is widely reported.
Maybe there's more apport could do in such cases, but currently the
value apport gives for these bugs is limited.

Anyway, my guess is that the crashes that occur when the user has an
inconsistent installation/running environment are more likely going to
fall into the latter category, and there is a high risk of the issue
just "magically going away" after rebooting, making isolation of a test
case especially hard and resulting in the user's interest in helping us
debug the problem vanish.  Of course, I could be wrong, and if it could
be shown that an appreciable number of Type-1 crashes are captured I'd
certainly change my mind, but at this point, unless we clear out the X
bug queue or uplevel apport significantly, I suspect we probably would
not add much value by capturing these crashes automatically.  Certainly
it wouldn't hurt (other than the increased volume), but there probably
are other apport features which would give more bang for the buck.

Bryce