NAK[F]: [PATCH 0/1] Multiple kexecs in AWS nitro instances fail
gpiccoli at canonical.com
Thu Apr 2 18:54:16 UTC 2020
Great Seth, thank you!
I saw Greg's email, but thought in sending to Focal anyway - better
safe than sorry =)
On Thu, Apr 2, 2020 at 3:43 PM Seth Forshee <seth.forshee at canonical.com> wrote:
> On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1869948
> > [Impact]
> > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro
> > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is
> > observed, with the following signature:
> > Initramfs unpacking failed: junk within compressed archive
> > [...]
> > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
> > Call Trace:
> > dump_stack+0x6d/0x9a
> > ? csum_partial_copy_generic+0x150/0x170
> > panic+0x101/0x2e3
> > ? do_execve+0x25/0x30
> > ? rest_init+0xb0/0xb0
> > kernel_init+0xfb/0x100
> > ret_from_fork+0x35/0x40
> > * After investigation (see LP comment 2), it was noticed the Amazon ena network
> > driver doesn't provide a shutdown() handler, hence it could be performing a DMA
> > transaction to a previous valid address during boot, which would then corrupt
> > kernel memory. The following patch was proposed and fixed the issue, allowing
> > 1000 kexecs to be executed successfully with no issues observed:
> > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec")
> > [ git.kernel.org/linus/428c491332bc ].
> > * Hence, we are hereby requesting SRU for this patch. It was tested in all
> > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success,
> > and reviewed/acked by ena driver team and a kexec developer from other distro.
> > Worth mentioning that we proposed an upstream multi-vendor discussion about
> > this issue: marc.info/?l=kexec&m=158299605013194 .
> > [Test case]
> > * The basic test procedure is about performing multiple kexecs sequentially;
> > AWS does not provide a full console, so in case of failures one could check
> > the instance screenshot or use pstore/ramoops in order to collect dmesg after
> > a crash in a preserved memory area. The commands used to perform kexec are:
> > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
> > systemctl kexec
> > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a
> > change in kexec command-line is desired; also, to execute the kexec-loaded
> > kernel both "kexec -e" and "systemctl kexec" are equally valid.
> > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used
> > here to perform 1000 kexecs with the proposed patch.
> > [Regression Potential]
> > * Although the patch proposed here introduce a PCI handler, it kept the remove
> > handler identical and based shutdown strongly on ena_remove(), changing just
> > netdev handling following other upstream drivers. It was extensively tested
> > and presented no issue. Also, it's self-contained and affect only one driver,
> > so any other cloud providers or non-cloud environment wouldn't be even affected
> > by the patch.
> > * In case of a potential regression, it could manifest as a delay or issue
> > on reboot/shutdown path, only if ena driver is in use.
> This patch has already been applied to focal from upstream stable.
More information about the kernel-team