[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Steve Langasek steve.langasek at canonical.com
Tue Feb 6 00:50:40 UTC 2018


On Tue, Feb 06, 2018 at 12:11:21AM -0000, Mike Pontillo wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?

This was the first packet capture that Jason posted, in comment #30.  The
udp retransmits shown in packets 6262-6268 each receive an answering packet
in 6270-6271,6273-6277, in addition to 6269 as an answer to 6261.  For
whatever reason, wireshark here does not decipher these duplicate OACK
packets as OACK, but an examination of the raw packets shows that's clearly
what they are.

> I looked at a packet capture Andres pointed me to, and don't see the
> "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
> indicated by the (source port, dest port) tuple, and I see that MAAS
> correctly OACKs each individual transaction (per RFC 2347) - not the
> retry packets within the same transaction.

Packets 6269-6271,6273-6277 are all answers to the same port on the client.
They don't have the same source port, because MAAS has allocated a separate
source port for each of these.  It's not acking a separate individual
transaction, it's MAAS /creating/ a separate transaction (with the
allocation of a separate source port) for each one.

RFC2347 does not speak to this; the discussion of the port negotiation is in
RFC1350 ยง4:

   In order to create a connection, each end of the connection chooses a
   TID for itself, to be used for the duration of that connection.  The
   TID's chosen for a connection should be randomly chosen, so that the
   probability that the same number is chosen twice in immediate
   succession is very low.  Every packet has associated with it the two
   TID's of the ends of the connection, the source TID and the
   destination TID.  These TID's are handed to the supporting UDP (or
   other datagram protocol) as the source and destination ports.  A
   requesting host chooses its source TID as described above, and sends
   its initial request to the known TID 69 decimal (105 octal) on the
   serving host.  The response to the request, under normal operation,
   uses a TID chosen by the server as its source TID and the TID chosen
   for the previous message by the requestor as its destination TID.
   The two chosen TID's are then used for the remainder of the transfer.

MAAS responds to 8 udp retransmits on srcport=25305, dstport=69 by sending 8
independent OACK packets back to dstport=25305 each from a different source
port.

Since Andres confirms that these duplicate acks still only result in one
database query, this may be a negligible bug if the only impact is duplicate
small udp packets.  OTOH, depending on how MAAS implements this, it could
also result in port exhaustion on the server if unanswered OACKs are allowed
to linger.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions



More information about the foundations-bugs mailing list