Ongoing autopkgtest-cloud armhf maintenance

Thu Jan 13 11:19:56 UTC 2022

We are now operating at full capacity again. Turns out we also
did not have 11 workers, but 12, so anywhere I said 33 is
actually 36 :)

Some items remain TBD, but the rest is done and got us back
on our feet again:

On Wed, Jan 12, 2022 at 05:48:04PM +0100, Julian Andres Klode wrote:
> 
> # Pending work
> 
> - Move /var/snap/lxd/common out of /srv (where lxd storage pool lives);
>   this will likely require slightly increasing the '/' disk size.
> 
> - Investigate further where the 30s timeout in lxd comes from and how
>   to prevent that (or just ignore it, but next item)

2x TBD

> 
> - Investigate were the stuck instances came from and why they were not
>   cleaned up. Is it possible for us to check which instances should be
>   running and then remove all other ones from the workers? Right now
>   we just do a basic time check

There were no errors logged. I saw mentions of exit code -15, but
nothing concrete.

But we now have new cleanup where we only keep as many containers as needed,
deleting everything else older than 1 hour.

> 
> - The node lxd-armhf10 needs to finish its redeployment once the
>   lxd images exist again
> 
> - The node lxd-armhf9 needs to be redeployed to solve the disk I/O
>   issue
> 
> - Both lxd-armhf10 and lxd-armhf9 will have to be re-enabled with
>   the new IPs in the mojo service bundle

Those 3 redeployments have happened

> 
> - We should really redeploy all the lxd workers to have clean workers
>   again

TBD, need to figure out partitioning for /var/snap/lxd/common, but
does not seem urgent right now.
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en