[PATCH 0/1] Multiple kexecs in AWS nitro instances fail

Guilherme G. Piccoli gpiccoli at canonical.com
Wed Apr 1 21:40:25 UTC 2020


BugLink: https://bugs.launchpad.net/bugs/1869948


[Impact]

* Currently, users cannot perform multiple kernel kexec loads on AWS Nitro
instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is
observed, with the following signature:

Initramfs unpacking failed: junk within compressed archive
[...]
Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
Call Trace:
  dump_stack+0x6d/0x9a
  ? csum_partial_copy_generic+0x150/0x170
  panic+0x101/0x2e3
  ? do_execve+0x25/0x30
  ? rest_init+0xb0/0xb0
  kernel_init+0xfb/0x100
  ret_from_fork+0x35/0x40

* After investigation (see LP comment 2), it was noticed the Amazon ena network
driver doesn't provide a shutdown() handler, hence it could be performing a DMA
transaction to a previous valid address during boot, which would then  corrupt
kernel memory. The following patch was proposed and fixed the issue,  allowing
1000 kexecs to be executed successfully with no issues observed:
428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec")
[ git.kernel.org/linus/428c491332bc ].

* Hence, we are hereby requesting SRU for this patch. It was tested in all
supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success,
and reviewed/acked by ena driver team and a kexec developer from other distro.
Worth mentioning that we proposed an upstream multi-vendor discussion about
this issue: marc.info/?l=kexec&m=158299605013194 .

[Test case]

* The basic test procedure is about performing multiple kexecs sequentially;
AWS does not provide a full console, so in case of failures one could check
the instance screenshot or use pstore/ramoops in order to collect dmesg after
a crash in a preserved memory area. The commands used to perform kexec are:

kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
systemctl kexec

Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a
change in kexec command-line is desired; also, to execute the kexec-loaded
kernel both "kexec -e" and "systemctl kexec" are equally valid.

* On LP (comment 3) we proposed a script/approach to auto-test kexecs, used
here to perform 1000 kexecs with the proposed patch.

[Regression Potential]

* Although the patch proposed here introduce a PCI handler, it kept the remove
handler identical and based shutdown strongly on ena_remove(), changing just
netdev handling following other upstream drivers. It was extensively tested
and presented no issue. Also, it's self-contained and affect only one driver,
so any other cloud providers or non-cloud environment wouldn't be even affected
by the patch.

* In case of a potential regression, it could manifest as a delay or issue
on reboot/shutdown path, only if ena driver is in use.

Guilherme G. Piccoli (1):
  net: ena: Add PCI shutdown handler to allow safe kexec

 drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++----
 1 file changed, 41 insertions(+), 10 deletions(-)

-- 
2.25.2




More information about the kernel-team mailing list