[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Jason Hobbs
jason.hobbs at canonical.com
Thu Feb 1 20:49:42 UTC 2018
I also collected iotop output from the same run:
http://paste.ubuntu.com/26502363/
The storage setup on these nodes is writethrough bcache with a 400 GB
nvme in front of a 1TB spinning disk. Since it's writethrough, writes
have to make it to the spinning disk before being counted as sync'd.
The write numbers look high for random i/o on a spinning disk. It seems
possible that the slow MAAS performance is due to postgresql waiting for
writes to disk to complete, and MAAS threads blocking on that, so that
servicing DB reads is blocked on the commits completing first.
The VMs running on the machine are using this same bcache setup for
their storage pool. It looks like most of the disk write traffic is
coming from the VMs.
Based on this data we'll make two changes to our setup which I think should help alleviate this problem:
- move the VMs storage hosting to separate disk.
- change the storage setup to use writeback bcache.
** Attachment added: "iotop.txt.gz"
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047065/+files/iotop.txt.gz
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
Status in MAAS:
Incomplete
Status in grub2 package in Ubuntu:
New
Bug description:
A node failed to deploy after it failed to retrieve a grub.cfg from
MAAS due to a timeout. In the logs, it's clear that the server tried
to retrieve the grub cfg many times, over about 30 seconds:
http://paste.ubuntu.com/26387256/
We see the same thing for other hosts around the same time:
http://paste.ubuntu.com/26387262/
It seems like MAAS is taking way too long to respond to these
requests.
This is very similar to bug 1724677, which was happening pre-
metldown/spectre. The only difference is we don't see "[critical] TFTP
back-end failed" in the logs anymore.
I connected to the console on this system and it had errors about
timing out retrieving the grub-cfg, then it had an error message along
the lines of "error not an ip" and then "double free". After I
connected but before I could get a screenshot the system rebooted and
was directed by maas to power off, which it did successfully after
booting to linux.
Full logs are available here:
https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
ed277a020e7c/cpe_cloud_395/infra-logs.tar
This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
More information about the foundations-bugs
mailing list