Desktop Team meeting, 2008-08-14

Fri Aug 15 21:36:07 BST 2008

On Thu, Aug 14, 2008 at 03:19:28PM +0100, Matt Zimmerman wrote:
> On Thu, Aug 14, 2008 at 03:05:11PM +0100, Scott James Remnant wrote:
> > == Automatic installation reports ==
> > 
> > mvo asked how useful are the apt filed installation failed reports?
> > 
> > Should we not file them against the failed package (e.g. gedit) but
> > against a central virtual component (like install-failurs) because most
> > likely the problem is not with gedit, but with e.g. scrollkeeper that is
> > run in the gedit postinst? This way triaggers can pinpoint the problem
> > and assign to the right packages.  I got some complaints that the apt
> > bugreports clutter the buglists too much.
> 
> I'm curious about this, because I (as a tester) find this facility very
> useful in reducing the time it takes to collect the relevant information,
> check for duplicates and file a bug.
>
> What is the clutter?  If it's duplicate bugs, apport could be more
> aggressive/smarter about suppressing duplicates.  If it's bugs reported
> against the wrong package, maybe we could improve the duplicate detection in
> Launchpad.

I personally like this feature. I think that each maintainer script
failure/install failure is a problem and that we should try harder to
prevent those. Ideally we would have a system in place that
automatically rolls back in case of errors to the original state. But
because this is currently not possible I think we need to try to make
the maintainer scripts as robust as possible. 

Let me summaize the discussion during the meeting. It was argued that
the reports clutter the bug list because:

1. the information is often not useful ("postinst script failed with
   exit code 1" without further clues in the log
2. the bug gets assigned to the wrong package. most of the time its
   something in the postinst failing (e.g. scrollkeeper segfaulting) so
   the bug should be really on scrollkeeper, but it is first assigned
   to e.g. gedit because it happens to run scrollkeeper in its postinst
3. there is a lot of "noise" in the report, e.g. hardware problems
   (bad RAM) to cause some failure, customized systems that have
   random files removed (e.g. /etc/init.d/foo got removed and
   start-stop-daemon fails with a exit code) or simply the disk is
   full 
4. those are not bug reports but incident reports that can be turned
   into bugreports with manual labor (extract the right information
   etc) 

We discussed the following solutions:
- better client side filtering (e.g. don't report disk full errors)
- better server side filtering from the apport duplicate detector
- pushing all reports to a new pseudo package called
  "install-failures" and let the QA team triage those

I implemented some improvements to the client side filtering
(disk-full is filtered out now, better detection of folloup errors
directly in libapt) and I plan to discuss the "install-failures"
component with the qa team.

To get a better idea of the problem, I gathered some data from
launchpad and the apport-package tagged bugs there:

We currently have 2012 packages with the tag "apport-package" in
launchpad. Of these, 963 are open (not invalid, won't fix or fix
released). I looked at the most recent bugs to get a idea what
those are about and found:
(1) maintainer failure because of syntax error/incorrect use of
  programms/diverts: #258353, #257522, #257490, #257213, #257162,
  #257131, #257003, #256930, #254969
(2) file overwrite issues: #257736, #257299, #257244, #257133, #256743,
  #256239
(3) pkgs/programms with default that cause the maintainer scripts to fail
  when they install: #258042, #257989, #257832, #257737, #257527,
  #257142, #257040, #256737, #256423
(4) local customization/hardware that causes breakage: #257989, #257745,
  #257375, #256987, #256968, #235164, #256461, #256276
(5) unknown: #257418, #257386, #256766, #256653, #256454, #256204, #256184

I made the classification up on the spot and ordered by "usefulness"
for us. Category 1 and 2 are bugs that can be fixed or at least
diagnosed with the information in the bugreports.

Category 3 is bad default or bad recovery from problematic
conditions. E.g. a package that needs to connect to a database fails
when it can't for some reason (wrong passowrd) instead of re-trying or
to write a debconf note that some more configuration is required. Or a
package like java-doc that prompts the user to download some files
into a location and will fail if those files are not available.

Category 4 is problematic, some stuff like pycentral not overwriting
local python files could be improved (or we could provide a way to
override this restriction on dist-upgrades). Some others like "fork()"
failures because of low mem are much harder to deal with (or
filesystem corruption).

Catgeory 5 are the ones that I found no useful information in. Stuff
like "capplet-data postinst failed with error code 1" without any
further indication what went wrong.