ACK: [SRU][F/aws][PATCH 0/2] aws: fix hibernation issues on c5.18xlarge

Colin Ian King colin.king at canonical.com
Thu Mar 11 16:03:37 UTC 2021


On 11/03/2021 15:50, Andrea Righi wrote:
> [Impact]
> 
> Hibernation is still unreliable on c5.18xlarge instances, usually the
> system hibernates correctly, but on resume it either performs a regular
> reboot, instead of resuming from hibernation, or the system is
> completely stuck after the hibernated kernel is loaded in memory (more
> exactly the system is stuck when the resume callbacks of the hibernated
> kernel are executed).
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Amazon pointed out two fixes that should address both issues:
> 
> 1) upstream patch "PM: hibernate: flush swap writer after marking": this
>    prevents the regular reboot issue, because it ensures that the I/O is
>    always flushed after, not before, writing the hibernation signature
> 
> 2) we need to reserve more space for HVC_BOOT_ARRAY_SIZE: this is a
>    temporary solution (SAUCE PATCH for now), suggested by Amazon, they are
>    working on a proper (more elegant) fix, but doubling the size of
>    HVC_BOOT_ARRAY_SIZE seems to resolve the problem, we have tested this
>    change extensively in the AWS cloud and it seems to prevent the "system
>    stuck on resume" issue from happening
> 
> [Regression potential]
> 
> The first patch is touching only the hibernation code, so potential
> regressions could be experienced only in the hibernation scenario. The
> second patch is more like a hack at the moment and it's affecting
> kvmclock. Increasing the size of HVC_BOOT_ARRAY_SIZE could potentially
> introduce regressions on small sized kvm systems and a better solution
> would be to allocate the array hv_clock_boot dynamically. And this is
> actually the proper fix that Amazon is currently working on. When the
> fix will be published upstream we should apply that one and drop this
> SAUCE PATCH.
> 
> 

Looks OK to me. Thanks Andrea

Acked-by: Colin Ian King <colin.king at canonical.com>



More information about the kernel-team mailing list