[Bug 1896638] Re: Path to swapfile doesn't use a static device path

Alberto Contreras 1896638 at bugs.launchpad.net
Wed May 24 10:35:23 UTC 2023


It looks like AWS EC2 has disabled the ability to request spot instances with the interruption behavior set as 'hibernate'.
I have tried to reproduce it in multiple regions and with multiple valid instance types and I consistently get the following error:

```
launchSpecTemporarilyBlacklisted	Repeated errors have occurred processing the launch specification "t3.micro, ami-08d931621368a5861, Linux/UNIX, eu-west-3a while launching spot instance". It will not be retried for at least 13 minutes. Error message: The request with instanceType 't3.micro' and Linux/UNIX is not supported when instanceInterruptionBehavior is set to 'hibernate'. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterCombination; Proxy: null)
```

I have been able to reproduce and verify that the hibernation works and
that this bug is fixed simulating the workflow on normal instance with
bionic, focal, jammy and kinetic:

apt purge ec2-hibinit-agent
apt-get update
apt-get upgrade -y

cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF

apt-get update
apt-get install -y hibagent
apt-cache policy hibagent
systemctl is-active hibagent.target || /usr/bin/enable-ec2-spot-hibernation

# Verify no errors
systemctl status hibagent
journalctl -u hibagent

# Verify lp #1896638 (resume partition by PARTUUID)
grep PART /etc/default/grub.d/99-set-swap.cfg

systemctl hibernate

# Start the instance and verify the hibernation resuming was okay
systemctl status hibinit-agent
journalctl --reverse


** Tags removed: verification-done-xenial verification-needed verification-needed-bionic verification-needed-focal verification-needed-jammy verification-needed-kinetic
** Tags added: verification-done verification-done-bionic verification-done-focal verification-done-jammy verification-done-kinetic

** Tags removed: verification-done

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to hibagent in Ubuntu.
https://bugs.launchpad.net/bugs/1896638

Title:
  Path to swapfile doesn't use a static device path

Status in ec2-hibinit-agent package in Ubuntu:
  Fix Released
Status in hibagent package in Ubuntu:
  Fix Released
Status in ec2-hibinit-agent source package in Xenial:
  Fix Released
Status in hibagent source package in Xenial:
  Won't Fix
Status in ec2-hibinit-agent source package in Bionic:
  Fix Released
Status in hibagent source package in Bionic:
  Fix Committed
Status in ec2-hibinit-agent source package in Focal:
  Fix Released
Status in hibagent source package in Focal:
  Fix Committed
Status in ec2-hibinit-agent source package in Groovy:
  Fix Released
Status in hibagent source package in Groovy:
  Won't Fix
Status in hibagent source package in Jammy:
  Fix Committed
Status in hibagent source package in Kinetic:
  Fix Committed

Bug description:
  [Impact]

  * Using the device name on the kernel cmdline in the resume= option
  leads to failure to resume from hibernation when the device name is
  not stable, which can be the case for nvme drives.

  [Test Case]

  * ec2-hibinit-agent

    * Set up an EC2 instance to allow hibernation
    * Wait for hibinit-agent.service fully started
    * /etc/default/grub.d/99-set-swap.cfg should refer to the resume=partition by PARTUUID

  * hibagent

    * Spin up an EC2 spot instance with `hibernate` as `Interruption behavior` [1].
    * Install the latest hibagent: `sudo apt-get install hibagent`
    * Enable hibernation: `sudo /usr/bin/enable-ec2-spot-hibernation`
    * Create an AWS FIS experiment template to send a spot-instance-interruption signal [2], make it point to the created instance and launch it.
      Note: This step is optional, one can wait for AWS EC2 to send the interruption signal, but it could take a lot of time.
    * After some minutes, EC2 will send a signal to resume the interrupted instance.
    * Verify the instance has correctly been resumed from hibernation.

  [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html#specifying-spot-interruption-behavior
  [2] https://catalog.us-east-1.prod.workshops.aws/workshops/5fc0039f-9f15-47f8-aff0-09dc7b1779ee/en-US/030-basic-content/078-ec2-spot/020-spot-ec2-interrup

  [Regression Potential]

  * Failure to discover PARTUUID makes the system unable to resume. A
  potential crash would cause the system unable to set up hibernation or
  unable to resume. (On Focal PARTUUID is already in use, even without
  this fix.)

  [Original Bug Text]

  When the agent inserts the resume device path and offset into the
  kernel cmdline, it uses device names such as the following:

  `resume_offset=223232 resume=/dev/nvme1n1p1`

  The issue is that `/dev/nvme1n1p1` is not static. On the reboot, the
  block device may appear at `/dev/nvme0n1p1` resulting in failure to
  find the swapfile used to suspend.

  The solution should be to use a persistent block device naming scheme.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ec2-hibinit-agent/+bug/1896638/+subscriptions




More information about the foundations-bugs mailing list