Ongoing autopkgtest-cloud armhf maintenance

Wed Jan 12 16:48:04 UTC 2022

Hi all,

we discovered some issues we are currently addressing. This
is mostly a mental note for me to keep track of where I leave
this today, but it might be interesting to read for others.

At first, we found that out of 11 lxd armhf workers

- 1 corrupted its lxd database after running out of disk space
  (lxd-armhf10)
- 1 has disk I/O errors (lxd-armhf9)
- several have timeouts during stop
- several ran out of space

We should be running at 27/33 workers now.

# We have identified that

- While the lxd storage pool is on a btrfs file system, it was
  configured as 'dir', meaning that deleting the instances involves
  costly recursive file tree deletion rather than fast subvolume
  one

- Several instances were stuck. Yes, there were even quite a few
  stopped ephemeral ones

- While trying to rebuild the node with the broken db, we noticed
  that armhf default images are not available for jammy and impish
  on the lxd remote images:

- The storage pool used by the containers and /var/snap/lxd/common
  are allocated on the same filesystem, allowing tests to DoS the
  infrastructure by claiming all space in the file system (which can
  lead to corrupting the lxd sqlite database)

# Steps undertaken so far

- I have replaced lxc delete -f with lxc stop --force --timeout -1
  in the hopes it might not hit a timeout, that might be futile, though

- I changed the default storage pool for new instances to use the btrfs
  driver instead of the dir one.

- I shutdown the entire lxd cloud for a bit and then:
  + cleaned up all leftover instances
  + converted all workers to use btrfs storage backends instead of 'dir'
    ones
  + rebooted them all

- stgraber is investigating the missing images on
  images.linuxcontainers.org

# Pending work

- Move /var/snap/lxd/common out of /srv (where lxd storage pool lives);
  this will likely require slightly increasing the '/' disk size.

- Investigate further where the 30s timeout in lxd comes from and how
  to prevent that (or just ignore it, but next item)

- Investigate were the stuck instances came from and why they were not
  cleaned up. Is it possible for us to check which instances should be
  running and then remove all other ones from the workers? Right now
  we just do a basic time check

- The node lxd-armhf10 needs to finish its redeployment once the
  lxd images exist again

- The node lxd-armhf9 needs to be redeployed to solve the disk I/O
  issue

- Both lxd-armhf10 and lxd-armhf9 will have to be re-enabled with
  the new IPs in the mojo service bundle

- We should really redeploy all the lxd workers to have clean workers
  again

# Other notes

It probably would be nicer to use ubuntu-daily: remote instead of
images:, to use official images. However, we only really need a
system with ubuntu-minimal installed, and would need to disable
and remove anything else like cloud-init and snapd on those images.
It kind of would be nice to have like ubuntu-daily:focal/minimal
which just has ubuntu-minimal set.

We should investigate resource limits for individual test containers,
it does not make much sense that they can use all resources and hence
compete strongly with each other (I guess we can't limit disk space
usage, but RAM and cores would be a start).
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en