ACK: [PATCH 0/1] Multiple kexecs in AWS nitro instances fail

Guilherme Piccoli gpiccoli at canonical.com
Thu Apr 2 11:53:54 UTC 2020


Thanks Andrea and Cascardo! =)

On Thu, Apr 2, 2020 at 3:41 AM Andrea Righi <andrea.righi at canonical.com> wrote:
>
> On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1869948
> >
> >
> > [Impact]
> >
> > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro
> > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is
> > observed, with the following signature:
> >
> > Initramfs unpacking failed: junk within compressed archive
> > [...]
> > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
> > Call Trace:
> >   dump_stack+0x6d/0x9a
> >   ? csum_partial_copy_generic+0x150/0x170
> >   panic+0x101/0x2e3
> >   ? do_execve+0x25/0x30
> >   ? rest_init+0xb0/0xb0
> >   kernel_init+0xfb/0x100
> >   ret_from_fork+0x35/0x40
> >
> > * After investigation (see LP comment 2), it was noticed the Amazon ena network
> > driver doesn't provide a shutdown() handler, hence it could be performing a DMA
> > transaction to a previous valid address during boot, which would then  corrupt
> > kernel memory. The following patch was proposed and fixed the issue,  allowing
> > 1000 kexecs to be executed successfully with no issues observed:
> > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec")
> > [ git.kernel.org/linus/428c491332bc ].
> >
> > * Hence, we are hereby requesting SRU for this patch. It was tested in all
> > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success,
> > and reviewed/acked by ena driver team and a kexec developer from other distro.
> > Worth mentioning that we proposed an upstream multi-vendor discussion about
> > this issue: marc.info/?l=kexec&m=158299605013194 .
> >
> > [Test case]
> >
> > * The basic test procedure is about performing multiple kexecs sequentially;
> > AWS does not provide a full console, so in case of failures one could check
> > the instance screenshot or use pstore/ramoops in order to collect dmesg after
> > a crash in a preserved memory area. The commands used to perform kexec are:
> >
> > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
> > systemctl kexec
> >
> > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a
> > change in kexec command-line is desired; also, to execute the kexec-loaded
> > kernel both "kexec -e" and "systemctl kexec" are equally valid.
> >
> > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used
> > here to perform 1000 kexecs with the proposed patch.
> >
> > [Regression Potential]
> >
> > * Although the patch proposed here introduce a PCI handler, it kept the remove
> > handler identical and based shutdown strongly on ena_remove(), changing just
> > netdev handling following other upstream drivers. It was extensively tested
> > and presented no issue. Also, it's self-contained and affect only one driver,
> > so any other cloud providers or non-cloud environment wouldn't be even affected
> > by the patch.
> >
> > * In case of a potential regression, it could manifest as a delay or issue
> > on reboot/shutdown path, only if ena driver is in use.
> >
> > Guilherme G. Piccoli (1):
> >   net: ena: Add PCI shutdown handler to allow safe kexec
> >
> >  drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++----
> >  1 file changed, 41 insertions(+), 10 deletions(-)
>
> Makes sense to me. Good job!
>
> Acked-by: Andrea Righi <andrea.righi at canonical.com>



More information about the kernel-team mailing list