NAK[F]: [PATCH 0/1] Multiple kexecs in AWS nitro instances fail
seth.forshee at canonical.com
Thu Apr 2 18:43:11 UTC 2020
On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote:
> BugLink: https://bugs.launchpad.net/bugs/1869948
> * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro
> instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is
> observed, with the following signature:
> Initramfs unpacking failed: junk within compressed archive
> Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
> Call Trace:
> ? csum_partial_copy_generic+0x150/0x170
> ? do_execve+0x25/0x30
> ? rest_init+0xb0/0xb0
> * After investigation (see LP comment 2), it was noticed the Amazon ena network
> driver doesn't provide a shutdown() handler, hence it could be performing a DMA
> transaction to a previous valid address during boot, which would then corrupt
> kernel memory. The following patch was proposed and fixed the issue, allowing
> 1000 kexecs to be executed successfully with no issues observed:
> 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec")
> [ git.kernel.org/linus/428c491332bc ].
> * Hence, we are hereby requesting SRU for this patch. It was tested in all
> supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success,
> and reviewed/acked by ena driver team and a kexec developer from other distro.
> Worth mentioning that we proposed an upstream multi-vendor discussion about
> this issue: marc.info/?l=kexec&m=158299605013194 .
> [Test case]
> * The basic test procedure is about performing multiple kexecs sequentially;
> AWS does not provide a full console, so in case of failures one could check
> the instance screenshot or use pstore/ramoops in order to collect dmesg after
> a crash in a preserved memory area. The commands used to perform kexec are:
> kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
> systemctl kexec
> Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a
> change in kexec command-line is desired; also, to execute the kexec-loaded
> kernel both "kexec -e" and "systemctl kexec" are equally valid.
> * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used
> here to perform 1000 kexecs with the proposed patch.
> [Regression Potential]
> * Although the patch proposed here introduce a PCI handler, it kept the remove
> handler identical and based shutdown strongly on ena_remove(), changing just
> netdev handling following other upstream drivers. It was extensively tested
> and presented no issue. Also, it's self-contained and affect only one driver,
> so any other cloud providers or non-cloud environment wouldn't be even affected
> by the patch.
> * In case of a potential regression, it could manifest as a delay or issue
> on reboot/shutdown path, only if ena driver is in use.
This patch has already been applied to focal from upstream stable.
More information about the kernel-team