ACK/Cmnt: [SRU][jammy/linux-aws][kinetic/linux-aws][PATCH 00/20] UBUNTU: SAUCE: PM: Hibernate: Enable Hibernation for Xen Based Instance Types
Tim Gardner
tim.gardner at canonical.com
Wed Aug 17 13:24:45 UTC 2022
On 8/17/22 02:51, Gerald Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1968062
>
> SRU Justification:
>
> [Impact]
>
> Hibernation currently fails for all AWS Xen instance types
> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels.
>
> When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when
> processing the rootfs, fails to hibernate, and shuts down. When you start the
> instance, it starts fresh, and does not resume from the incomplete hibernation
> image. Networking is also broken, and you cannot ssh in.
>
> Upon review of the jammy/linux-aws git log, it appears that the kernel is
> missing AWS hibernation enablement patches entirely. These need to be included
> to get hibernation working.
>
> [Fix]
>
> Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline
>
> After careful review of the amazon-5.15.y/mainline branch, we have found the
> below set of patches authored by Amazon AWS Hibernation team to be minimally
> sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.
>
> xen: Restore xen-pirqs on resume from hibernation
> xen-netfront: call netif_device_attach on resume
> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
> xen: restore pirqs on resume from hibernation.
> block: xen-blkfront: consider new dom0 features on restore
> x86: tsc: avoid system instability in hibernation
> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
> Revert "xen: dont fiddle with event channel masking in suspend/resume"
> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> x86/xen: close event channels for PIRQs in system core suspend callback
> xen/events: add xen_shutdown_pirqs helper function
> x86/xen: save and restore steal clock
> xen/time: introduce xen_{save,restore}_steal_clock
> xen-netfront: add callbacks for PM suspend and hibernation support
> xen-blkfront: add callbacks for PM suspend and hibernation
> x86/xen: add system core suspend and resume callbacks
> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
> xenbus: add freeze/thaw/restore callbacks support
> xen/manage: introduce helper function to know the on-going suspend mode
> xen/manage: keep track of the on-going suspend mode
>
> These patches will be carried as SAUCE patches, and their subjects marked with
> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the
> repo being the Amazon Linux 2 kernel repo.
>
> [Testcase]
>
> 1. Log into Amazon EC2.
> 2. Select Launch Instance.
> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
> 5. Select your SSH keypair.
> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
> 8. Create the Instance. SSH in.
> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
> 10. Start a screen session. Echo some text and then detach with ctrl-d.
> 11. Log out from instance.
> 12. In EC2, select "Instance State" > "Hibernate".
> 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
> 14. Start the instance again.
> 15. SSH in.
> 16. Attempt to resume screen session with "screen -r".
>
> If you are not able to ssh into the instance, hibernation had failed. If ssh
> works and the screen session is still running, hibernation was successful.
>
> Alternatively, the CPC team can run their Hibernation testsuite over Jammy and
> Kinetic.
>
> We have built test kernels for Jammy and Kinetic with the patches, and they are
> available in the below ppa:
>
> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test
>
> If you try and hibernate and resume with the test kernels, hibernation is
> successful.
>
> [Where problems could occur]
>
> We are adding a significant amount of code to the Xen subsystem, spread across
> many commits. This code has not been mainlined, and is instead maintained out
> of tree by the Amazon AWS Hibernation team.
>
> The changes target hibernation, block devices, and clock devices, specific to
> those used on AWS Xen instances. Most of these patches have been applied to
> Xenial, Bionic, Focal and other series for a long time, but some patches are
> new for 5.15 onward.
>
> The changes will only target linux-aws to try and limit regression risk to
> AWS users, and any regressions will be limited to users of Xen based instance
> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.
>
> If a regression were to occur, the instance would likely fail to hibernate, and
> at worst, write an incomplete hibernation image to the swapfile. The kernel will
> see this on start, and instead of resuming from the hibernation image, will
> start fresh. It is unlikely to cause any filesystem corruption on the rootfs,
> but any in progress computations at the time of hibernation could be lost. The
> current broken behaviour breaks networking, and users would have to power cycle
> the instance a few times before they can ssh in again.
>
> Aleksei Besogonov (1):
> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
>
> Anchal Agarwal (4):
> x86/xen: Introduce new function to map HYPERVISOR_shared_info on
> Resume
> Revert "xen: dont fiddle with event channel masking in suspend/resume"
> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
> xen: Restore xen-pirqs on resume from hibernation
>
> Eduardo Valentin (2):
> x86: tsc: avoid system instability in hibernation
> block: xen-blkfront: consider new dom0 features on restore
>
> Frank van der Linden (3):
> xen: restore pirqs on resume from hibernation.
> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
> xen-netfront: call netif_device_attach on resume
>
> Munehisa Kamata (10):
> xen/manage: keep track of the on-going suspend mode
> xen/manage: introduce helper function to know the on-going suspend
> mode
> xenbus: add freeze/thaw/restore callbacks support
> x86/xen: add system core suspend and resume callbacks
> xen-blkfront: add callbacks for PM suspend and hibernation
> xen-netfront: add callbacks for PM suspend and hibernation support
> xen/time: introduce xen_{save,restore}_steal_clock
> x86/xen: save and restore steal clock
> xen/events: add xen_shutdown_pirqs helper function
> x86/xen: close event channels for PIRQs in system core suspend
> callback
>
> arch/x86/kernel/tsc.c | 29 ++++++
> arch/x86/xen/enlighten_hvm.c | 8 ++
> arch/x86/xen/suspend.c | 67 +++++++++++++
> arch/x86/xen/time.c | 3 +
> arch/x86/xen/xen-ops.h | 2 +
> drivers/block/xen-blkfront.c | 161 ++++++++++++++++++++++++++++--
> drivers/net/xen-netfront.c | 104 ++++++++++++++++++-
> drivers/xen/events/events_base.c | 30 +++++-
> drivers/xen/manage.c | 73 ++++++++++++++
> drivers/xen/time.c | 29 +++++-
> drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++---
> include/linux/irq.h | 2 +
> include/linux/sched/clock.h | 5 +
> include/xen/events.h | 2 +
> include/xen/xen-ops.h | 8 ++
> include/xen/xenbus.h | 3 +
> kernel/irq/chip.c | 4 +-
> kernel/power/user.c | 4 +
> kernel/sched/clock.c | 4 +-
> 19 files changed, 604 insertions(+), 33 deletions(-)
>
Acked-by: Tim Gardner <tim.gardner at canonical.com>
Nice work. Since I'm likely the one that will apply these patches, I'm
going to make 2 changes.
1) Add hibernation to the commit subject so that the intent of the patch
is clear.
2) Add the URL to Amazon git repository in the commit message.
6 months from now those 2 bits of info will be a big help in remembering
what these patches are for, especially for those of us with goldfish
memories.
rtg
--
-----------
Tim Gardner
Canonical, Inc
More information about the kernel-team
mailing list