Hirsute +1 Duty Report

Christian Ehrhardt christian.ehrhardt at canonical.com
Thu Mar 18 07:52:52 UTC 2021


Hi,
I'm done with my +1 week (started later, ended later - still a week) and wanted
to summarize what happened.
As usual I'm not talking about all the minor trigger-here/trigger-there
that would be a waste of virtual ink. But some cases were more interesting
than that and those I wanted to mention here especially in case that others
have hit the same or later will follow on on these cases.

TL;DR:
- unblocked libzip transition
  - fix openscad test fail
  - fix mysql FTBFS @rsicv64
  - fix php-easyrdf test race
- unblocked spyder transition
  - fixed spyder-memory-* tests
  - had spyder-requests removed
- unblocked the opencascade transition
  - fix freecad OOM @ s390x
  - analyzed netgen fail in depth
    - reported the issue upstream
    - based on the analysis Locutus uploaded a test skip at armhf
- resolved iperf 2.0.14a breaking tests
  - analyzed the case affecting mininet/openvswitch
  - eventually had the new version removed from hirsute-proposed
  - upstream bugs filed to resolve this for 21.10
- resolved mininet 2.3.0 breaking tests
  - analyzed the case and uploaded a new vswitch
  - new vswitch and mininet migrated into hirsute-relase now
- resolved some more uncommon build/tests issues
  - psmisc done (lost test result) and now migrated into -release
  - ddd done (toolchain issue that now is resolved on rebuild) and migrated

There are 5 more issues left of the "rebuild for fixed permission" set
2 FTBFS and 3 test fails; one should continue on those.
Maybe a good candidate for further +1?

Much much more details of the same below ...


#1 openscad

openscad/2021.01-1 autopkgtests never worked.
Bad on First Februaray and up to now.
    1072 - pdfexporttest_centered (Failed)
All good tests are wit the old version 2019.05-5

This package is entangled with libzip which blocks quite a bunch of others.
So unblocking this would help proposed more than just for this package.

Works fine in Debian Ci:
https://ci.debian.net/data/autopkgtest/testing/amd64/o/openscad/10926296/log.gz

Checked a local VM based repro as-is and all-proposed.
Both failed.

Then followed a long strange trip which eventually led to pkgstripfiles/optipng
breaking the test data files
See https://bugs.launchpad.net/ubuntu/+source/openscad/+bug/1918445 for more
details.

I uploaded a fix to hirsute and submitted it to Debian (a no-op for them)
to later on be able to sync it.


#2 I was pinged that bug 1915275 FTBFSing mysql on riscv64 also blocks libzip

Indeed since glibc was forced into release without resolving this issue it
made things worse.
Other than stated/assumed on the bug it isn't just breaking the tests
the existing mysql-8.0 on riscv64 now (with new glibc) fails to even install
the package. Thereby essentially all of `reverse-depends --release hirsute
--build-depends src:mysql-8.0` are FTFBS on riscv64 now and php7.4 is just one
of many - only difference now blocking the libzip transition.

After a longer debug session I've found that the return value of sysconf wasn't
handled properly and thereby breaking the allocation at unsigned long(-1) size.

I've filed a bug and submitted a fix upstream, as well as an MP for the
packaging  to resolve this in Hirsute asap.

Over the weekend I tracked that these builds worked and rebuilt PHP, that aspect
no more blocks php->libzip.


#3 php-easyrdf

This is a universe package and blocks on a rebuild not triggered by our Team.
So the chances anyone looks for it without a ping were rather low.

I did a triage seeing that the new version 1.0.0-2 slipped through with an
aborting but considered ok test - but actually it never worked.

A check with bryce showed that it wasn't a known case from the recent phpunit
activities, but also nothing that the team currently looks at.

So to unblock libzip I've taken a look at it.

It turned out to be a race which I fixed and submitted to Debian but also
something else that is yet unclear (not reproducible in s390x canonistack, but
failing in autopkgtest).

For now a retry-frenzy seems to resolve the issue sometimes, but since all
tests are marked as "superficial" the state "neutral" is the best one can
achieve.

The underlying issue seems to be a race in the php server init and the test
trying to use it. I have added some hardening against that case and debugged it
with s390x runs of autopkgtest-infra against the PPA.

I have submitted this to Debian, but since the problem isn't present there
(but could happen at any time) it isn't urgent to them and we can't wait.
So I uploaded an ubuntu1 version to unblock things in hirsute.

To have this visible for others looking at excuses I also filed:
https://bugs.launchpad.net/ubuntu/+source/php-easyrdf/+bug/1919125


#5 request-tracker5

I have seen many packages B-D on request-tracker5 but found that this would
actually be a transition and therefore it is good to be held back in -proposed
until we open for 21.10 (Then there is a build time test fail that needs to
be resolved). Do we want/need to do more to hold it back and not be fixed by
accident? Should we remove that from -proposed to avoid that and clear the
view a bit - or would this make alter re-syncing harder?

I didn't reach a full conclusion on this one as other items made more progress.
If an AA has a strong opinion on "yeah remove it" then please feel
free to do so.


#6 Spyder

spyder has a new major version 4.x which causes autopkgtest fails
=>  https://launchpad.net/ubuntu/+source/spyder/4.2.1+dfsg1-3
One dependency of these packages has a new version in proposed and we need to
test against that - I've done it and it resolved.
The other one is incompatible with 4.x and removed in testing
=> https://tracker.debian.org/news/1235339/spyder-reports-removed-from-testing/
IMHO we'd want to do the same, so I pinged AAs to help me with that and after
removal the rest migrated fine.


#7 Freecad

freecad fails on the autopkgtests on s390x
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/s390x/f/freecad/20210310_103548_24da1@/log.gz
This is reproducible on canonistack.
This fails on Debian CI @s390x just as much
https://ci.debian.net/data/autopkgtest/testing/s390x/f/freecad/10997677/log.gz
Seems that the new version just isn't working fine on s390x, needs some
debugging to decide between a fix or resetting the tests (TBH cad @ s390x isn't
a really important thing).
I found Brian has assessed the same:
https://bugs.launchpad.net/ubuntu/+source/freecad/+bug/1918474
The obvious thought is "mark it as big test" but I at least wanted
the confirmation that it then would work, so I spawned a few s390x Hirsute
guests of different sizes.

This is also tied in the opencascade migration which I looked at next.

A day later Debian already accepted my changes and uploaded this together with
an upstream fix release - this LGTM and didn't need an FFe so we can sync this
to unblock the issue at hand.


#8 Opencascade / netgen

After unblocking freecad I found that it was entangled with opencascade.
And other than freecad it was also blocked on netgen that had a build fail
on armhf.

There already was a bug report about it under
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984439
But not much progress on a real solution to it.

Also I'm not 100% sure that the Debian report I've found is indeed
the very same issue we face in our builds atm:
   test_pickling.py Fatal Python error: Bus error
   ...
   Bus error (core dumped)

In Debian the error does not happen
  https://buildd.debian.org/status/package.php?p=netgen
So maybe this is a new case by glibc/gcc/.. .being newer?
The last successful build was in November 2020, there the toolchain was
quite different than today.
OTOH the most recent upload has a change like "[5426125] Fix running tests"
and tests are what breaks, so maybe it wasn't run before at all.

In arm64 canonistack + armhf LXD this reproduces just fine.
$ export PYTHONPATH="$PYTHONPATH:/root/netgen-6.2.2006+really6.2.1905+dfsg/debian/tmp/usr/lib/python3/dist-packages"
$ apt install python3-tk python3-numpy
$ cd ~/netgen-6.2.2006+really6.2.1905+dfsg/tests/pytest
$ LD_LIBRARY_PATH=/root/netgen-6.2.2006+really6.2.1905+dfsg/debian/tmp/usr/lib/$DEB_HOST_MULTIARCH
python3 -m pytest -k test_pickling -s
...
test_pickling.py Bus error (core dumped)

The other tests pass
test_pickling.py::test_pickle_stl PASSED
test_pickling.py::test_pickle_occ PASSED
test_pickling.py::test_pickle_geom2d PASSED
test_pickling.py::test_pickle_mesh PASSED

Just test_pickle_csg fails.
And in this test the failing line is:
  geo_dump = pickle.dumps(geo)
With geo being
  <netgen.libngpy._csg.CSGeometry object at 0xf6da99b0>

Running that in python3-dbg and gdb into the core file shows the pickling
deep into netgen's code (which is better than a generic pickling issue I guess)

#0  0xf659c99e in ngcore::BinaryOutArchive::Write<double>
(x=10000000000, this=0xffa90cc4) at
./libsrc/stlgeom/../general/../core/archive.hpp:732
#1  ngcore::BinaryOutArchive::operator& (this=0xffa90cc4,
d=@0x26aa6d8: 10000000000) at
./libsrc/stlgeom/../general/../core/archive.hpp:681
#2  0xf641d4de in netgen::Surface::DoArchive (archive=...,
this=0x26aa6d0) at ./libsrc/csg/surface.hpp:68
#3  netgen::OneSurfacePrimitive::DoArchive (archive=...,
this=0x26aa6d0) at ./libsrc/csg/surface.hpp:344
#4  netgen::QuadraticSurface::DoArchive (this=0x26aa6d0, ar=...) at
./libsrc/csg/algprim.hpp:52
#5  0xf641dc00 in netgen::Sphere::DoArchive (this=0x26aa6d0, ar=...)
at ./libsrc/csg/algprim.hpp:151
#6  0xf6434c28 in ngcore::Archive::operator&<netgen::Surface, void>
(val=..., this=0xffa90cc4) at
./libsrc/csg/../general/../core/archive.hpp:307
#7  ngcore::Archive::operator&<netgen::Surface>
(this=this at entry=0xffa90cc4, p=@0x2727718: 0x26aa6d0) at
./libsrc/csg/../general/../core/archive.hpp:490
#8  0xf6430dca in ngcore::Archive::Do<netgen::Surface*, void>
(n=<optimized out>, data=<optimized out>, this=0xffa90cc4) at
./libsrc/csg/../general/../core/archive.hpp:280
#9  ngcore::Archive::operator&<netgen::Surface*> (v=std::vector of
length 32, capacity 32 = {...}, this=0xffa90cc4) at
./libsrc/csg/../general/../core/archive.hpp:209
#10 ngcore::SymbolTable<netgen::Surface*>::DoArchive<netgen::Surface*>
(ar=..., this=0x2843c64) at
./libsrc/csg/../general/../core/symboltable.hpp:44
#11 ngcore::Archive::operator&<ngcore::SymbolTable<netgen::Surface*>,
void> (val=..., this=0xffa90cc4) at
./libsrc/csg/../general/../core/archive.hpp:307
#12 netgen::CSGeometry::DoArchive (this=0x2843c60, archive=...) at
./libsrc/csg/csgeom.cpp:329
#13 0xf648a958 in ngcore::Archive::operator&<netgen::CSGeometry, void>
(val=..., this=0xffa90cc4) at
./libsrc/csg/../general/../core/archive.hpp:305
#14 ngcore::Archive::operator&<netgen::CSGeometry>
(this=this at entry=0xffa90cc4, p=@0xffa90ba4: 0x2843c60) at
./libsrc/csg/../general/../core/archive.hpp:518
#15 0xf64a4218 in ngcore::NGSPickle<netgen::CSGeometry,
ngcore::BinaryOutArchive,
ngcore::BinaryInArchive>()::{lambda(netgen::CSGeometry*)#1}::operator()(netgen::CSGeometry*)
const (
    self=<optimized out>, this=<optimized out>) at
/usr/include/pybind11/pytypes.h:199
...

./libsrc/stlgeom/../general/../core/archive.hpp:732 is
    *reinterpret_cast<T*>(&buffer[ptr]) = x; // NOLINT

With:
(gdb) p &buffer
$5 = (std::array<char, 1024> *) 0xffa90d40
(gdb) p ptr
$3 = 1

Depending on how the real code (not gdb) interprets this pointer addition
that might explain the sigbus as it reflects unaligned access and if it
adds that up to just "0xffa90d41" (which happens in gdb) then it fails.

Debugging this deeper without context knowledge will be messy.
Maybe I can identify a related toolchain issue or workaround.
So I built it with gcc-9 and gcc-11 (as it worked in November),
but both builds behaved the same way.
I checked the older builds, they just worked because they didn't run the tests.
So it was broken all along but now is an FTBFS.

I'm a bit lost here, I doubt I'll be very effective going deeper into the
hpp code that defines this. Instead I think I have collected quite some logs
and insights and filed an upstream bug (to discuss/resolve this) as well as
a launchpad bug so that it is visible as an update-excuse.
=> https://bugs.launchpad.net/ubuntu/+source/netgen/+bug/1919335
=> https://github.com/NGSolve/netgen/issues/89

There was no response on this in the days that I was actively on +1 duty,
anyone looking at the same case later is recommended to take a look at the
current state of these bugs/discussions.

A day later Locutus (Thanks!) also looked at the same and agreed, to resolve
the issue for now he made armhf to be "dh_auto_test || true". So opencascade
will resolve once all that is complete.


# 9 openvswitch test fails

I'm a bit familiar with this but OTOH it is nothing I'd usually look after
(unless I did an upload) as this is mostly at home with the openstack team,
but seeing things block on it for 28 and 136 days indicates this will stay
broken unless someone takes a look.

Uploads of a new `mininet` as well as a new `iperf` fail to test against this
on all architectures.

It seems a few people already hit retry on this, but there was no bug or
documentation about it yet. It has three sub-tests of which only one called
"vanilla" breaks. The new mininet makes it "fail" and iperf makes it time out.

I was retrying this on autopkgtest-infra (no queues atm) and in a local VM
once with and once without the new packages for further debugging.

Issues:
- with all-proposed python2 issues calling python2
- with the new iperf
- with the new mininet
- as-is it fails with RTNETLINK answers: File existsrface pair (s1-eth1,h1-eth0)

Solution:
- mininet 2.3 switched from py2 only to py3 only
  - adapt test dependencies and python calls in d/t/*
  - that also resolved the "existsrface pair" issues
- iperf 2.0.14 is actually an alphy of 2.1 and has massive changes
  - It is long enough in proposed for the FF, but 2.1.1 would maybe be better
  - neither the new nor the old mininet is compatible with this yet
  - I'd hat to have the new iperf enter hirsute now and break all kinds
    of automated tests where it is often used.
  - IMHO this iperf build shall be removed and in 21.10 has a new chance
    even better as 2.1 or 2.1.1 then

I debugged the iperf/mininet incompatibility a bit and filed a bug to resolve
that mid-term.
=> https://github.com/mininet/mininet/issues/1060
For Hirsute I filed a removal bug for the new iperf version
=> https://code.launchpad.net/~paelzer/ubuntu/+source/openvswitch/+git/openvswitch/+merge/399771
   James page reviewed, merged and uploaded that
And for openvswitch/mininet I opened an MP that will resolve it
=> https://bugs.launchpad.net/ubuntu/+source/iperf/+bug/1919432

Some builds and test re-triggers later all those were resolved.
The new openvswitch tested fine and migrated into hirsute-release, then a test
retrigger later also mininet was working and ready.

Furthermore there were some follow ups in the upstream discussion. It seems
that in 21.10 the (then) newer mininet should be compatible with the new iperf.


# 10 psmisc

Being such a core package I was wondering that this hung in excuses for 28 days
already. I found that it had a test at armhf "lost". It wasn't failed or passed,
just non existing and from britney's POV it was waiting for the test result.
I guess we'd want that new (minor) upstream release in Hirsute so I had a look.
After re-issuing the test is succeeded and this became ready to migrate.


#11 ddd

This had build errors, but actually is an important rebuild from the hiccup
that had created wrong permissions in built packages. Gladly this was a
toolchain issue back then and now is resolved.
=> https://launchpad.net/ubuntu/+source/ddd/1:3.3.12-5.3build1
It might be worth to note that there are a few others left of that
rebuild-burst that are still stuck in one or the other way:
FTFBS:
- https://launchpad.net/ubuntu/+source/clisp/1:2.49.20180218+really2.49.92-3build5
- https://launchpad.net/ubuntu/+source/nng/1.4.0-1build1
Test fails:
- https://launchpad.net/ubuntu/+source/gnome-activity-journal/1.0.0-3build1
- https://launchpad.net/ubuntu/+source/php-apcu/5.1.19+4.0.11-3build1
- https://launchpad.net/ubuntu/+source/ruby-httpclient/2.8.3-2build1
I was out of time, but I'd guess that one also should look after those?


-- 
Christian Ehrhardt
Staff Engineer, Ubuntu Server
Canonical Ltd



More information about the ubuntu-devel mailing list