[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Andres Rodriguez andreserl at ubuntu-pe.org
Tue Feb 6 16:40:30 UTC 2018


On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs <jason.hobbs at canonical.com>
wrote:

> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
> <andreserl at ubuntu-pe.org> wrote:
> > I think there's a misunderstanding on how the network boot process
> happens:
> > Let's look at pxe linux first. Pxe linux does this:
> >
> > 1. tries UUID first # if no answer, it moves on
> > 2. Tries mac # if no answer, it moves on
> > 3. tries full IP address # if no answer, it moves on
> > 4. tries partial IP address # if no answer, it moves on
> > 5. does 4
> > 6. does 4
> > [...]
> > 7. boots default.
> >
> > This can be seen in here:
> >
> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
> > /mybootdir/pxelinux.cfg/C0A8025B
> > /mybootdir/pxelinux.cfg/C0A8025
> > /mybootdir/pxelinux.cfg/C0A802
> > /mybootdir/pxelinux.cfg/C0A80
> > /mybootdir/pxelinux.cfg/C0A8
> > /mybootdir/pxelinux.cfg/C0A
> > /mybootdir/pxelinux.cfg/C0
> > /mybootdir/pxelinux.cfg/C
> > /mybootdir/pxelinux.cfg/default
> >
> >
> > That said, in the case of grub, this behavior is similar. You have
> > described this behavior in comment #16. So what is it that's happening:
> >
> > 1. grub is trying grub.cfg-<mac> address multiple times, but since it
> > doesn't get a response, it gives it.
> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
> >
> > That said, the requests are handled completely different. The -<mac>
> > requests actually accesses the *node* object in the database  by
> searching
> > it with the mac address where the request is made. With this node object,
> > we generate the config file.
> >
> > In comparison, the -default-amd64 does *not* access the node object. It
> > just access two config settings and the db query is *much* cheaper. Also,
> > we have to keep in mind that after grub has done many retries, this
> returns
> > rather fast in comparison because it is not only cheaper, but at that
> point
> > MAAS may be with way less load of queued DB requests. Either way, grub
> > giving up means that it wont expect for the initial request, but it will
> > expect a new response for the new file it asked for.
> >
> > That said, this is working *exactly* as expected, because this
> effectively
> > tells grub "if config for your MAC address was not returned, you can
> safely
> > assume you are an unknown machine to MAAS", hence grub requests a
> different
> > config file to start the enlistment process.
>
> Except it's not an unknown machine, and MAAS treating it like one is
> bad behavior and a bug.


> This is not "working exactly as expected".  "Working exactly as
> expected" would be my machine being deployed when I asked for it to
> be.
>

Yes, it is not an unknown machine, but that doesn;t change the fact that
this is working as designed. If the client didn't get a response for the
request it makes, and the client decides to move on and makes a different
request, then it is working as designed. Again, the bug here is not on the
clients behavior, the bug here is on the fact that the response is not
being done in a timely manner.


>
> > So this is *not* a race condition in MAAS. This is working as designed
> and
> > is expected. The problem here is that MAAS takes too long to answer the
> > initial request, which causes grub to timeout and move on to request a
> > different config file.
>
> Yes, because there is a race condition in the design - the MAC
> specific file has to be generated before grub times out.  It could
> instead be generated before the node ever starts booting, allowing it
> to be served just as fast as the -default-amd64 file is, eliminating
> that race condition.
>

It is not a race condition. It is doing exactly what it was told to do. It
request X thing, didn't get a response, then it requested Y thing, and got
a response. The fact that there's no response when X happens on a /timely/
manner is not a race, its a bug on the server side. So, if the machine were
to not be known to MAAS, it would work as expected. But since it is known
and the response doesn't come on a timely manner for grub, it moves on.
This is the same behavior pxe, uboot and other network bootloaders follow.

And yes, you could argue that the config could be generated before the node
starts booting, but what you are not considering is that the node can boot
from any rack controller really and that would require maas to send the
same file to all rack controllers in the same vlan the machine is booting
from and write files onto the disk dynamically, which in fact, can impact
performance even more. The fact the config is generated on the fly is
because it is generated for the specific rack controller where the machine
is booting from and that;'s the intended design.

>
> Jason
>
> > On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs <jason.hobbs at canonical.com>
> > wrote:
> >
> >> The packetdump (comment #35) of MAAS not responding to grub's request
> >> for the mac specific grub.cfg before grub times out, and then responding
> >> immediately to the generic-amd64 grub cfg, clearly shows a race
> >> condition in MAAS.
> >>
> >> MAAS's design of dynamically generating the interface specific grub
> >> config only after it receives the tftp request for it is susceptible to
> >> a race condition where grub times out before MAAS can respond.
> >>
> >> That design is not the only possible design.  All the information
> >> required for the interface specific grub.cfg is available before the
> >> machine ever powers on, and could be made available on the rack
> >> controllers at that time too.
> >>
> >> Doing so would eliminate that race condition, or at least reduce the
> >> opportunity greatly, as we see MAAS has no problems immediately
> >> responding and serving files that it doesn't need to dynamically
> >> generate at request time.
> >>
> >> There is still some question around what in the environment is
> >> contributing to MAAS not responding faster, and what MAAS is doing while
> >> it takes 60+ seconds to respond to the request, but that doesn't change
> >> the fact that the current MAAS design is racy (and that's a bug).
> >>
> >> Whatever we change in the environment to reduce the likelihood of
> >> hitting this issue there doesn't solve the underlying race condition in
> >> MAAS, and leaves open the possibility of hitting the issue other places
> >> too.
> >>
> >> --
> >> You received this bug notification because you are subscribed to MAAS.
> >> https://bugs.launchpad.net/bugs/1743249
> >>
> >> Title:
> >>   Failed Deployment after timeout trying to retrieve grub cfg
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
> >>
> >> Launchpad-Notification-Type: bug
> >> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> >> importance=Undecided; assignee=None;
> >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> >> status=In Progress; importance=Medium; assignee=mathieu.tl at gmail.com;
> >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> >> Launchpad-Bug-Information-Type: Public
> >> Launchpad-Bug-Private: no
> >> Launchpad-Bug-Security-Vulnerability: no
> >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> vorlon
> >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> >> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> >> Launchpad-Message-Rationale: Subscriber (MAAS)
> >> Launchpad-Message-For: andreserl
> >>
> >
> >
> > --
> > Andres Rodriguez (RoAkSoAx)
> > Ubuntu Server Developer
> > MSc. Telecom & Networking
> > Systems Engineer
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> >   Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> >   New
> > Status in grub2 package in Ubuntu:
> >   In Progress
> >
> > Bug description:
> >   A node failed to deploy after it failed to retrieve a grub.cfg from
> >   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >   to retrieve the grub cfg many times, over about 30 seconds:
> >
> >   http://paste.ubuntu.com/26387256/
> >
> >   We see the same thing for other hosts around the same time:
> >
> >   http://paste.ubuntu.com/26387262/
> >
> >   It seems like MAAS is taking way too long to respond to these
> >   requests.
> >
> >   This is very similar to bug 1724677, which was happening pre-
> >   metldown/spectre. The only difference is we don't see "[critical] TFTP
> >   back-end failed" in the logs anymore.
> >
> >   I connected to the console on this system and it had errors about
> >   timing out retrieving the grub-cfg, then it had an error message along
> >   the lines of "error not an ip" and then "double free".  After I
> >   connected but before I could get a screenshot the system rebooted and
> >   was directed by maas to power off, which it did successfully after
> >   booting to linux.
> >
> >   Full logs are available here:
> >   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >   ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> >   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu.tl at gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions



More information about the foundations-bugs mailing list