[Bug 1896638] Re: Path to swapfile doesn't use a static device path
Alberto Contreras
1896638 at bugs.launchpad.net
Wed May 24 10:35:23 UTC 2023
It looks like AWS EC2 has disabled the ability to request spot instances with the interruption behavior set as 'hibernate'.
I have tried to reproduce it in multiple regions and with multiple valid instance types and I consistently get the following error:
```
launchSpecTemporarilyBlacklisted Repeated errors have occurred processing the launch specification "t3.micro, ami-08d931621368a5861, Linux/UNIX, eu-west-3a while launching spot instance". It will not be retried for at least 13 minutes. Error message: The request with instanceType 't3.micro' and Linux/UNIX is not supported when instanceInterruptionBehavior is set to 'hibernate'. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterCombination; Proxy: null)
```
I have been able to reproduce and verify that the hibernation works and
that this bug is fixed simulating the workflow on normal instance with
bionic, focal, jammy and kinetic:
apt purge ec2-hibinit-agent
apt-get update
apt-get upgrade -y
cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF
apt-get update
apt-get install -y hibagent
apt-cache policy hibagent
systemctl is-active hibagent.target || /usr/bin/enable-ec2-spot-hibernation
# Verify no errors
systemctl status hibagent
journalctl -u hibagent
# Verify lp #1896638 (resume partition by PARTUUID)
grep PART /etc/default/grub.d/99-set-swap.cfg
systemctl hibernate
# Start the instance and verify the hibernation resuming was okay
systemctl status hibinit-agent
journalctl --reverse
** Tags removed: verification-done-xenial verification-needed verification-needed-bionic verification-needed-focal verification-needed-jammy verification-needed-kinetic
** Tags added: verification-done verification-done-bionic verification-done-focal verification-done-jammy verification-done-kinetic
** Tags removed: verification-done
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to hibagent in Ubuntu.
https://bugs.launchpad.net/bugs/1896638
Title:
Path to swapfile doesn't use a static device path
Status in ec2-hibinit-agent package in Ubuntu:
Fix Released
Status in hibagent package in Ubuntu:
Fix Released
Status in ec2-hibinit-agent source package in Xenial:
Fix Released
Status in hibagent source package in Xenial:
Won't Fix
Status in ec2-hibinit-agent source package in Bionic:
Fix Released
Status in hibagent source package in Bionic:
Fix Committed
Status in ec2-hibinit-agent source package in Focal:
Fix Released
Status in hibagent source package in Focal:
Fix Committed
Status in ec2-hibinit-agent source package in Groovy:
Fix Released
Status in hibagent source package in Groovy:
Won't Fix
Status in hibagent source package in Jammy:
Fix Committed
Status in hibagent source package in Kinetic:
Fix Committed
Bug description:
[Impact]
* Using the device name on the kernel cmdline in the resume= option
leads to failure to resume from hibernation when the device name is
not stable, which can be the case for nvme drives.
[Test Case]
* ec2-hibinit-agent
* Set up an EC2 instance to allow hibernation
* Wait for hibinit-agent.service fully started
* /etc/default/grub.d/99-set-swap.cfg should refer to the resume=partition by PARTUUID
* hibagent
* Spin up an EC2 spot instance with `hibernate` as `Interruption behavior` [1].
* Install the latest hibagent: `sudo apt-get install hibagent`
* Enable hibernation: `sudo /usr/bin/enable-ec2-spot-hibernation`
* Create an AWS FIS experiment template to send a spot-instance-interruption signal [2], make it point to the created instance and launch it.
Note: This step is optional, one can wait for AWS EC2 to send the interruption signal, but it could take a lot of time.
* After some minutes, EC2 will send a signal to resume the interrupted instance.
* Verify the instance has correctly been resumed from hibernation.
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html#specifying-spot-interruption-behavior
[2] https://catalog.us-east-1.prod.workshops.aws/workshops/5fc0039f-9f15-47f8-aff0-09dc7b1779ee/en-US/030-basic-content/078-ec2-spot/020-spot-ec2-interrup
[Regression Potential]
* Failure to discover PARTUUID makes the system unable to resume. A
potential crash would cause the system unable to set up hibernation or
unable to resume. (On Focal PARTUUID is already in use, even without
this fix.)
[Original Bug Text]
When the agent inserts the resume device path and offset into the
kernel cmdline, it uses device names such as the following:
`resume_offset=223232 resume=/dev/nvme1n1p1`
The issue is that `/dev/nvme1n1p1` is not static. On the reboot, the
block device may appear at `/dev/nvme0n1p1` resulting in failure to
find the swapfile used to suspend.
The solution should be to use a persistent block device naming scheme.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ec2-hibinit-agent/+bug/1896638/+subscriptions
More information about the foundations-bugs
mailing list