+1 maintenance report 2023-06-[19..23]

Fri Jun 23 21:42:28 UTC 2023

Hello,

I have been on +1 for the week. This has been my second shift.

This is a long e-mail. I decided to approach this +1 shift differently. My
goal was "merely" to leave things better for the next +1 person(s) and I
decided to work on fewer tasks but for which the new status is easy to
understand (i.e. not half-done).

I could have made this shorter but I thought it could be interested for
people new to +1 and also possibly to get feedback from the ones very
used to package migration (but I definitely won't complain about anyone
not reading everything). I also saw value in understand and
explaining/sharing infos about some of the migrations since we're right
after the Debian thaw and some of the migrations will not be done before
at least several weeks.

It's also long because I wouldn't really be able to mention the people who
helped otherwise. Even though probably not many more packages migrated,
I think things have improved a fair bit and that's also thanks to more
than a dozen people across Ubuntu and Debian. Thanks!

# Global stuff

I've decided to finish on Thursday evening / Friday morning and dedicate
Friday to closing the last tasks, making sure all the relevant bugs are
created and linked, and writing my report so that I'm sure it's sent by the
end of the week. Last time I was still working on the packages themselves on
Friday and the report was severely delayed.

I focused on clusters this time. This difference wasn't purely by choice. This
time there were very clear clusters while my previous shift happened when the
test runners were in a very bad shape and there was nothing I should obviously
look at first.

I've used Matthieu's visual-excuses tool for this; it's not displaying linked
bugs at the moment and that can mislead but this should be added soon. In
the meantime, keep reading update_excuses.html (not that its UX is perfect
either: linked bugs are at the end of the list while they should be near the
top). I've kept the corresponding browser tab opened throughout my shift.
Getting rid of a cluster is also quite motivating.

  visual-excuses: https://github.com/mclemenceau/visual-excuses/

I've also used a tool that I started writing on Monday in order to
analyze logs. It's very very young at the moment but I'll soon be able to
start collecting logs. I'm talking about it a bit more in the last section of
this message.

# The shift itself

## Infrastructure issues

I started looking for transient failures. These can be low effort, yet
rewarding. More importantly, it's very frustrating to be waiting for easy
targets at the end of the week when they could have been spotted and retried
at the beginning of the week instead.

I loaded the Autopkgtest Grafana and noticed the large queue of armhf
tests. I've seen large queues associated with transient failures before and
many logs were indeed showing websocket and network errors. The crux is that a
heavily loaded infra can result in timeouts which prevents the packages from
migrating until they're retried when the infrastructure is less loaded.

  autopkgtest grafana: https://ubuntu-release.kpi.ubuntu.com/d/76Oe_0-Gz/autopkgtest?orgId=1&from=now-10d&to=now

On Monday I worked on my tool to list packages with such errors; in parallel,
Grahama independently retried many packages that I later wanted to see
retried. At least the goal with my tool seemed sound.
The tests still failed though. After bringing up the topic on MM on Wednesday,
Paride investigated and raised the issue to IS. At some point after this, the
tests were passing normally again.

Things would have been better if this had been spotted and reported earlier
on. Looking at the Grafana board, I can see several hints but I don't know if
any single one is enough to conclude something very fishy was going on.

- there was a large increase in queue size for all arches late Wednesday last
  week but armhf was the only arch with a non-empty _and_ _growing_ queue
  immediately before that jump
- the job throughput for armhf was 0 for 12 hours soon after the queue size
  increase
- the jobs abnormal failure rate reached 100% and had no data point for 12
  hours soon after the queue size increase; rates for other arches also
  reached 100% several times but kept being logged at a sensible rate
- the jobs testbed failure count is related to the failure rate and doesn't
  seem to yield hints (also, I don't know if it's testbed failures or test
  failures)
- the graph of jobs finished per cloud is difficult to read/understand for me
- cloud workers in active / error seems an interesting dashboard for this
  event since there's a large increase in errors that seems correlated with
  the issue here; can someone add a graph for the "error/(active+error)" ratio
  since that might be easier to parse visually?

Overall I think it's worth looking at these graphs when starting a maintenance
shift and maybe throughout it, but a cursory glance for queue size was not
enough.

## Clusters 

Clusters galore!

As I hinted before, the visual-excuses tool was very useful here. I could
easily spot clusters for coq, R, Ruby, gnustep, a couple ones for rust, a
cluster for perl (but since it's under foundations' umbrella, I happily
ignored it) rather than jumping around the .html file, trying to keep the
relationship graph in my head.

NB: I'll be listing update-excuse bugs below even when I haven't touched them

### Coq

update-excuse bug for coq-unimath: https://launchpad.net/bugs/2024463

There were three issues. A quick look on https://tracker.debian.org showed two
packages were already updated; they got synced within a day and migrated soon
after.

The problematic package was coq-unimath which FTBFS on armhf with an error
like 'Exception: Invalid_argument "String.create".'; OCaml rarely goes foobar
so that really meant a String couldn't be created with the required size
(armhf is the only 32-bit arch here). The Debian maintainer dropped support
for armhf in reaction to this. I noticed a change upstream for the file that
was failing to build and which mentions "32 bit".

I asked on the Debian bug tracker if there was interest in restoring support
since coq is demanding and there are probably very few users on armhf. Julien
Puydt stated he was happy to keep diverse platforms and was overall very quick
to react. After building (~27 hours on emulated armhf) and confirming the
commit fixed the FTBFS, it was integrated in Debian. Since then, the patch
trickled down to Ubuntu and is currently queued for testing on s390x.
Unfortunately the s390x queue has been partly stuck.  but hopefully it should finish
before Monday and the cluster should migrate right after.

LP: https://bugs.launchpad.net/ubuntu/+source/coq-unimath/+bug/2024463

### Ruby

update-excuse bug for ruby-rackup: https://launchpad.net/bugs/2024463

I didn't notice the update-excuse bug at first.

That's an ongoing transition which is not ready yet. Lucas Kanashiro is
working on the transition but also stated on IRC that some apps still do not
support ruby-rack version 3 upstream. Lucas said he is welcoming patches for
this transition; I didn't have time to get deeper into this topic however.

### R

update-excuse bug for r-base: https://bugs.launchpad.net/ubuntu/+source/r-base/+bug/2020799

There's also an R transition ongoing. As far as I understand, it "started"
when a package meant for Debian experimental was instead pushed to unstable.

There's a breaking change in R (well, at least one) where C bindings need to
explicitly "enable registration", whatever that means and implies. I'm
counting 41 affected packages and I don't think it's going down (you will see
"DLL requires the use of native symbols" in the logs). The migration will
probably take a long time, especially since at least some of these projects
haven't seen any commit in several years.

Since Steve had worked on these packages, I asked him why not drop these
changes instead. His answer was that it would require creating a list of
packages to drop and restore and that it's easier to make the topic clear in
an update-excuse bug.

The cluster is fairly well isolated from the rest of the packages except for
one link to hunspell through r-cran-hunspell. Graham mentioned that if needed,
r-cran-hunspell could be rebuilt in the release pocket to avoid the dependency
on the new R version.

### GNUstep-base

update-excuse bug: none; I'm not sure I would know what to say in it...

There's an issue with NSURL or its tests. I quickly found a commit upstream
about using "example.com" for tests rather than "httpbin.org". I assumed that
httpbin.org had gone down fairly recently and that tests had been failing
because of that.
Since it was now up, I confidently asked to retry the tests. And they failed.
I dug a bit more and didn't find anything obvious. I was not able to reproduce
locally.
At first I wasn't able to run the tests locally because while d/test/testsuite
says to use the same "configure" invocation as in d/rules, it doesn't; after
changing that, I was abl to run autopkgtest but I was unable to reproduce the
failure, even with a completely made up host. The tests also use a
locally-spawned server on port 1234 which shouldn't cause specific issues (I'm
now wondering if something else might be using port 1234).

The logs show stuff like "Failed: [...] OK" which is 110% unhelpful. I gave
up. The gnustep-base cluster is entirely isolated and low impact.

### eccodes

update-excuse bug for eccodes: https://bugs.launchpad.net/ubuntu/+source/eccodes/+bug/2024934

This is meteorology software. There were missing builds on several (all?)
arches due to trying to pass a pointer to float where a pointer to double is
expected. There are errors on Debian too.

I guess the issue was introduced in https://github.com/ecmwf/eccodes/commit/a8ddefceaf6090f17cff3be7b8bd46fd117b77d4 .

The bug tracker for the project is not public. Since it is difficult to
discuss with the upstream maintainer and there is no way to know the reason
for the commit above based on its commit message, I decided to not spend more
time on this.

Looking again at the package a couple days later, I can see there has been a
new upload to unstable on the 18th and the changelog seems quite related to
these issues:

    * ecCodes now only builds on 64-bit archs.
    * Don't use FAST_BIGENDIAN code; it breaks on s390x due to const double
      constraints

It still doesn't build on s390x but this shows the debian maintainer is
working on it. I think it's probably better to wait a bit and see in a few
days or weeks if anything more is needed.

### Rust

#### rust-gio

update-excuse bug for rust-gio: https://bugs.launchpad.net/ubuntu/+source/rust-gio/+bug/2021531

On armhf the rust runtime was unable to kill itself the way it wanted to do so
it still killed itself but in a different way (I'm great at summaries!). I
asked Simon about this and he suspected a toolchain issue on armhf. I left this
to him.

Zixing then investigated, traced that to glib, found an upstream patch which
he backported, Jeremy Bicha merged it in Debian and Simon sync'ed it. It is
unfortunately stuck in what can maybe be called "debcargo dependency hell"
(see the rust-several-other-things section below).

#### rust-sequoia-openpgp

update-excuse bug: none since the issue with it has been solved even though it
cannot migrate due to rust-rustls (more on that two sections below)

SIGKILL during tests. I quickly asked Simon for his gut feelings; for such
errors, he typically suspects running out of memory. I reproduced similar
errors on amd64 by limiting the memory available to autopkgtest (qemu runner).

I wasted some time because I misread the errors and thought this was for armhf
but it was actually for arm64. The solution was to mark this as big_package.
Graham quickly merged the change. I wouldn't be surprised armhf fails in a few
months or years on armhf but maybe that at that point, these runners will have
more than 1.5GB of memory.

There were other issues around finding the right set of triggers, which is
what I will be discussing in the next section. This lead to rust-rustls which
I talk about two sections below.

#### rust-several-other-things

This isn't specific to a single package.

I discovered a failure mode for rust packages which was common this week and
quite annoying. I don't know if it was completely unknown before or merely not
well identified and publicized.

The setup is: package rust-x is triggered for package rust-y and it pulls
rust-z which in turn pulls librust-z-dev. This -dev package always matches the
most recent version of rust-z but the version installed can be lower since the
trigger doesn't include rust-z. The solution is to include rust-z in the
trigger, using the version of librust-z-dev that apt installs (just search
for librust-z-dev in the test log).

You can easily spot this for rust packages because the log will contain

    crate directory not found: /usr/share/cargo/registry/

This sounds simple but in practice it's pretty annoying. First there's not
always only one package to add to triggers. Second, by the time you look at
the issue, add the trigger, and the tests finish, it's possible that some
related packages have arrived and add new constraints (and I'm not counting
the time needed for someone to load my trigger links because I never had to
wait for that this week! :) ).

That sounds a awful lot like "dependency hell".

Sebastien mentioned on IRC that it was also possible to use the all-proposed
hammer. I'd like to look at rust packages for this a few more times because
I'm wondering if acting early can prevent the cluster from growing, which
could avoid requiring more complex triggers. However, if this happens more,
some improvement will be needed. Obviously, by the time I proofread my report,
we got all tests for rust-glib failing due to this and we have something to
experiment on.

#### rust-rustls

update-excuse for rust-rustls-pemfile: https://bugs.launchpad.net/debian/+source/rust-rustls-pemfile/+bug/2024936

This package was blocking the migration around
rust-sequoia-openpgp/rustls/rustl-pemfile/reqwest/...

I thought this was due to the aforementioned issue with triggers but that
didn't fully make sense and after a test retry, I asked Simon if a no-change
rebuild made sense. He immediately answered that no-change rebuilds are very
rarely needed for rust packages, and he started digging deeper.

After seeing that the package is not maintained by the Debian Rust Maintainers
team and is built differently, Simon found a report explaining the issue on
the Debian BTS (linked from the update-excuse bug).

The plan at the moment is to wait for the fix to land in Debian (at some point
next week probably).

#### rust-syn

Graham investigated this along with me: in debcargo-conf, the package version
number is hard-coded in d/tests/tescontrol. For rust-syn it doesn't match the
one in d/changelog.

I've opened an issue upstream on yesterday. It has been fixed. I think the fix
was already in a branch by then but the branch wasn't merged. I wasn't able
to easily retrace the commit and merge history. The rust packages change
really quickly at the moment (I rewrote this section thrice today but at least
the comment section on gitlab automatically updates, and does so very
quickly).

Rust-syn is now not migrating but that's due to rust-versionize-derive.

## "magic"

Unlike coq people who have plenty of puns up their sleeves (like "flock" or
"quickchick"), I'm really bad at names. The goal is to make something that
will magically pinpoint issues so I went for that in order not to waste time.

This is the project I started on Monday to get some insight into logs. It's
far from being generally usable at the moment. I've already found it very
valuable to be able to grep through logs and count failures. I've been storing
logs uncompressed in a key-value database but a better fit will be an sqlite
database with the ability to select by package, test, arch, version and maybe
timeframe. Think: I want to know how many packages had testbed failures for
armhf compared to all other arches. In the future, I'd like to experiment with
automatic extraction of errors by comparing logs across their history.

-- 
Adrien