+1 maintenance report

Tue Feb 16 21:53:14 UTC 2021

On Mon, Feb 15, 2021 at 05:36:15PM +0100, Jan Ceuleers wrote:
> On 13/02/2021 04:49, Seth Arnold wrote:
> > Could we build a retriggerbot that smashes the retry button three times
> > before bothering any humans about failed tests?

Actually, we can be a lot more precise than that; see below.

> > Hitting retry is often the first troubleshooting step people take; I've
> > heard tests may be retried something like ten times by different people,
> > each of whom was taking a reasonable enough "first debugging step"
> > without noticing that other people have also done the same.
> 
> Not an Ubuntu developer but I do work as a quality manager. Not sure
> whether my list post will be accepted, so I'm copying you.
> 
> The assumption underlying your suggestion is that tests that
> intermittently fail do so because of intermittent failures in the test
> environment rather than due to actual bugs that manifest themselves only
> intermittently (such as race conditions).
> 
> This is fine if you have evidence that the assumption holds in a
> sufficiently large majority of cases.

It's certainly true that "randomly retry tests" has proven to be an
effective way to get things unblocked.  No denying that.

In this particular situation, though, what I did was scanned build logs
for certain phrases such as 'Unable to connect to ftpmaster', 'Temporary
failure resolving', etc. that tend to be strong indicators of
environment problems rather than test problems.  There are a few other
good heuristics like tests that fail on only one architecture and pass
on all the others, or that are FTBFS on just one arch and haven't been
rebuilt in >15 days.  I also suspect some arch's may be more likely to
see environmental failures than others, but I don't have conclusive data
there yet.  And yes, obviously if someone's already retriggered the test
within the past few days and it still failed, there's little need to
retry that specific set of retriggers on that migration item again.

So, I strongly agree with Seth that there's some good automation
potential in retriggering things; plus, I think we can be even more
precise in how this is done by looking at what's causing these kinds of
failures, and then hopefully use resources more efficiently.

Bryce