ACK: [SRU][F/aws][PATCH 0/2] aws: fix hibernation issues on c5.18xlarge
Colin Ian King
colin.king at canonical.com
Thu Mar 11 16:03:37 UTC 2021
On 11/03/2021 15:50, Andrea Righi wrote:
> [Impact]
>
> Hibernation is still unreliable on c5.18xlarge instances, usually the
> system hibernates correctly, but on resume it either performs a regular
> reboot, instead of resuming from hibernation, or the system is
> completely stuck after the hibernated kernel is loaded in memory (more
> exactly the system is stuck when the resume callbacks of the hibernated
> kernel are executed).
>
> [Test plan]
>
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
>
> [Fix]
>
> Amazon pointed out two fixes that should address both issues:
>
> 1) upstream patch "PM: hibernate: flush swap writer after marking": this
> prevents the regular reboot issue, because it ensures that the I/O is
> always flushed after, not before, writing the hibernation signature
>
> 2) we need to reserve more space for HVC_BOOT_ARRAY_SIZE: this is a
> temporary solution (SAUCE PATCH for now), suggested by Amazon, they are
> working on a proper (more elegant) fix, but doubling the size of
> HVC_BOOT_ARRAY_SIZE seems to resolve the problem, we have tested this
> change extensively in the AWS cloud and it seems to prevent the "system
> stuck on resume" issue from happening
>
> [Regression potential]
>
> The first patch is touching only the hibernation code, so potential
> regressions could be experienced only in the hibernation scenario. The
> second patch is more like a hack at the moment and it's affecting
> kvmclock. Increasing the size of HVC_BOOT_ARRAY_SIZE could potentially
> introduce regressions on small sized kvm systems and a better solution
> would be to allocate the array hv_clock_boot dynamically. And this is
> actually the proper fix that Amazon is currently working on. When the
> fix will be published upstream we should apply that one and drop this
> SAUCE PATCH.
>
>
Looks OK to me. Thanks Andrea
Acked-by: Colin Ian King <colin.king at canonical.com>
More information about the kernel-team
mailing list