APPLIED(X,B,E)/cmt: [PATCH 0/1] Multiple kexecs in AWS nitro instances fail

Khaled Elmously khalid.elmously at canonical.com
Sat Apr 4 00:00:21 UTC 2020


Ack Guilherme - there are no 5.0 AWS kernels so not applying this to Disco. Thanks


On 2020-04-03 08:39:52 , Guilherme Piccoli wrote:
> Hi Khaled, thanks for raising this flag!
> 
> I understand Disco is EOL but kernel 5.0 for some flavors will still
> be released - this patch is only really useful on aws (ena driver is
> present there), so if kernel 5.0 will get released to -aws flavor, I'd
> say please apply in 5.0 too. Otherwise, I don't see a strong reason
> for it.
> 
> Cheers,
> 
> 
> Guilherme
> 
> On Thu, Apr 2, 2020 at 11:36 PM Khaled Elmously
> <khalid.elmously at canonical.com> wrote:
> >
> > Applied to X and B and E .
> >
> > Guilherme - does this need to go in Disco as well? Disco is EOL but there are still several 5.0 derivatives.
> > For what it's worth, the Eoan/Focal version of the patch applies cleanly to 5.0
> >
> > Thanks
> >
> >
> >
> > On 2020-04-01 18:40:25 , Guilherme G. Piccoli wrote:
> > > BugLink: https://bugs.launchpad.net/bugs/1869948
> > >
> > >
> > > [Impact]
> > >
> > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro
> > > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is
> > > observed, with the following signature:
> > >
> > > Initramfs unpacking failed: junk within compressed archive
> > > [...]
> > > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
> > > Call Trace:
> > >   dump_stack+0x6d/0x9a
> > >   ? csum_partial_copy_generic+0x150/0x170
> > >   panic+0x101/0x2e3
> > >   ? do_execve+0x25/0x30
> > >   ? rest_init+0xb0/0xb0
> > >   kernel_init+0xfb/0x100
> > >   ret_from_fork+0x35/0x40
> > >
> > > * After investigation (see LP comment 2), it was noticed the Amazon ena network
> > > driver doesn't provide a shutdown() handler, hence it could be performing a DMA
> > > transaction to a previous valid address during boot, which would then  corrupt
> > > kernel memory. The following patch was proposed and fixed the issue,  allowing
> > > 1000 kexecs to be executed successfully with no issues observed:
> > > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec")
> > > [ git.kernel.org/linus/428c491332bc ].
> > >
> > > * Hence, we are hereby requesting SRU for this patch. It was tested in all
> > > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success,
> > > and reviewed/acked by ena driver team and a kexec developer from other distro.
> > > Worth mentioning that we proposed an upstream multi-vendor discussion about
> > > this issue: marc.info/?l=kexec&m=158299605013194 .
> > >
> > > [Test case]
> > >
> > > * The basic test procedure is about performing multiple kexecs sequentially;
> > > AWS does not provide a full console, so in case of failures one could check
> > > the instance screenshot or use pstore/ramoops in order to collect dmesg after
> > > a crash in a preserved memory area. The commands used to perform kexec are:
> > >
> > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
> > > systemctl kexec
> > >
> > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a
> > > change in kexec command-line is desired; also, to execute the kexec-loaded
> > > kernel both "kexec -e" and "systemctl kexec" are equally valid.
> > >
> > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used
> > > here to perform 1000 kexecs with the proposed patch.
> > >
> > > [Regression Potential]
> > >
> > > * Although the patch proposed here introduce a PCI handler, it kept the remove
> > > handler identical and based shutdown strongly on ena_remove(), changing just
> > > netdev handling following other upstream drivers. It was extensively tested
> > > and presented no issue. Also, it's self-contained and affect only one driver,
> > > so any other cloud providers or non-cloud environment wouldn't be even affected
> > > by the patch.
> > >
> > > * In case of a potential regression, it could manifest as a delay or issue
> > > on reboot/shutdown path, only if ena driver is in use.
> > >
> > > Guilherme G. Piccoli (1):
> > >   net: ena: Add PCI shutdown handler to allow safe kexec
> > >
> > >  drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++----
> > >  1 file changed, 41 insertions(+), 10 deletions(-)
> > >
> > > --
> > > 2.25.2
> > >
> > >
> > > --
> > > kernel-team mailing list
> > > kernel-team at lists.ubuntu.com
> > > https://lists.ubuntu.com/mailman/listinfo/kernel-team



More information about the kernel-team mailing list