adsys SRU

Wed Jun 14 07:31:42 UTC 2023

Hey Chris, let me chime in.

Le 14/06/2023 à 08:26, Christopher James Halse Rogers a écrit :
> There's an Jammy/Lunar adsys SRU¹ in the queue at the moment, and I 
> think it needs bringing to up to the list for discussion.
>
> The changelog looks like approximately 9 months of normal feature 
> development. The diff against Jammy is >3MB in size (due largely to 
> significant vendored-dependency churn it seems). The relevant part of 
> SRU policy - “Other safe cases”² - allowing feature addition, says “If 
> existing software needs to be modified to make use of the new feature, 
> it must be demonstrated that these changes are unintrusive, have a 
> minimal regression potential, and have been tested properly”. It looks 
> like adsys is well tested, but I'm not sure about these being minimal 
> changes or with minimal regression potential ☺.
>
> It's true that we've done a wholesale backport of adsys 0.9.2³ to 
> Jammy in the past; however, in that case the changes were mostly 
> listed bugfixes or FTBFS fixes, and the feature addition was shipping 
> a *Windows* binary.
>
> I'm writing this to ubuntu-release@ for two main reasons:
>
> 1. It seems valuable to include adsys updates in LTS releases; 
> however, I'm not sure that the scope of changes (and seeming 
> criticality of the system - “failures might prevent users from logging 
> in” seems pretty bad) falls under the existing delegation of power 
> from the Tech Board to the SRU team.

Unfortunately, like many projects, there is a constant tension between 
the request for new features backport (adsys, as being an enterprise 
product, only really makes sense in a LTS context) and bug fixes. Most 
of the new features are developed due to industry requirements, which are:
- evolution of their own security practices (for instance, certificates 
support)
- request for other platform supports (winbind in addition to 
already-existing sssd)

Due to our very limited team capacity, already max-ed out and being 
split between many projects on different themes, our only way to have a 
good adsys support, while answering the two previous requirements is to 
support only one single code base version, meaning, shipping the same 
code base in all supported releases. As most of the dependencies are 
vendored (apart from some limited dynamic C linking or dep on samba/sssd 
for instance), we are in control of what we ship and know exactly what’s 
our quality base is on it (more details on that in the next paragraphs).

> 2. There's a *lot* of vendored code churn, and from the SRU 
> perspective I have no information as to whether that's appropriate. I 
> understand that the Go ecosystem does not follow our ideas of stable 
> releases and there's a real tension here - it's a huge amount of work 
> to vet dependency updates, and such updates are *likely* to include 
> bug fixes. I don't think “we just update all our vendored dependencies 
> each SRU to whatever upstream is most recently shipping” is an 
> appropriate standard, though. I'm not sure what *is* the right 
> balance, though.

Right, but also, you need to take into consideration the following:

- as we are vendoring dependencies, accepted as part of the MIR process, 
it means that we, as upstream, takes the responsibility in front of the 
security team to handle security fixes inside those dependencies. Most 
of the security fixes in the various dependencies comes only with new 
upstream "release" (even if in the Go ecosystem, this is mostly a tag). 
FYI, the Rust ecosystem is following the same pattern and the vendoring 
exception is allowed for it too.
- as we took that responsability of vendoring, and updating them, it 
means that we need to do that work as part of the SRU process too.
- however, due to the very, very, limited team capacity mentioned above, 
we need to pick our battle and supporting a "single code base" 
(including vendored dependencies) is the only way we can go.

So, with that amount of diff, how do we ensure we can ship something we 
trust and that we are not impacted by any kind of regressions?

1. This can only be done by automated tests.
As of today, I count 1557 automated tests on the adsys repository alone. 
Those are unit/package/integration tests, using golden files to project 
exactly the desired expected for each tests on the file system: 
https://github.com/ubuntu/adsys/tree/main/cmd/adsysd/integration_tests/testdata/TestPolicyUpdate/golden/current_user%2C_first_time.

All those are run against the exact same versions of vendored 
dependencies and Go version that is going to be built against in the 
distro on our CI, even when we automatically update one of the vendored 
dependency: https://github.com/ubuntu/adsys/actions/runs/5257398861

We run those tests with **and** without built-in Go race detector. Also, 
we are testing untrusted inputs (like the Windows Active Directory GPO 
utf16 little-endian input) with fuzz testing, and we already fixed some 
crashes with it, like https://github.com/ubuntu/adsys/pull/333.

2. All the changes are reviewed by a peer (or developed with pair 
programming sessions), which ensure that everything that entered is 
carefully tested and review.

The only gap I can identify right now are on the end to end tests:
- Maybe the Windows AD controller changes and this has an impact on us 
(on this one, quite unlikely as Active Directory is decades old and 
doesn’t seem to have major changes anymore).
- Samba/sssd/kerberos can change from one version of Ubuntu to another 
and impacts us, as we are reusing part of their outcome as fixtures.
We are covering this with - unfortunately - manual end to end testing 
for every SRU or upload to the current development version. We are 
aiming (and have a Jira Epic we drafted this cycle) to start having that 
automated. It’s a complex environment because we need some Windows 
servers alongside our Ubuntu machines, those end user tests needs to 
reboot our machines multiple times, change some configuration on the 
Windows side to reflect on the Ubuntu one and so on.

This is why we covered that part with manual testing as a stop gap 
solution, which is to ensure that 3rd party, non vendorizable, 
components of the systems, are still functioning correctly. However, it 
doesn't protect the opposite: an upload of samba breaking us, which 
happened in the development version for instance where a 10 years old 
vendorized heimdal samba code was updated in one shot in lunar dev 
release. Good luck to find the regression between thousands of commits! 
We have lost hours on this. So updating vendored dependencies as fast as 
possible helps reducing this issue IMHO as we do in adsys rather than 
increasing as in the samba case. This is why we need to have our 
automated end to end tests to ship with even more confidence and less 
manual intervention, but this requires also networking between multiple 
OS and machines, and we need autopkgtests enhancements for this.

I think that should shed some lights on how we ensure a high quality 
level. This project is shipped and used in different enterprise 
environments, and I can say that if you compare the volume of usage 
having big names, compared to the amount of bug reported (most of them 
are either feature requests or gardening work opened by us to keep our 
code base modern: https://github.com/ubuntu/adsys/issues and 
https://bugs.launchpad.net/ubuntu/+source/adsys), even after major SRUs 
like the one you mentioned, we don’t have to do emergency fixes. This is 
giving us trust and confidence that our coding practices and processes, 
are supporting us in delivering high quality software despite all the 
constraints I mentioned above.

As a more general topic, I don’t think the SRU team (as the MIR team) is 
in position in terms of time (not being a full-time team) or even 
knowledge, to really understand every diff entering the distribution 
itself. (I have the same opinion when we enter the distro freeze and the 
release team review each diff). So, I see those teams roles more about 
assessing impact/risk of a change and how much trust there is in 
upstream to be proactive in term of quality or reactive in term of any 
issue that arose.

> So, in summary: I have two questions - does this exceed SRU authority, 
> and need Tech Board approval, and what level of justification is there 
> for wide ranging vendored code updates in the SRU?.

I think one way forward is for adsys to file up the Special documented 
cases with all the information above and enter the list where we trust 
and ensure that upstream is accountable for the SRU? 
https://wiki.ubuntu.com/StableReleaseUpdates#Documentation_for_Special_Cases

Thanks for considering it,
Didier