[Bug 1836429] [NEW] friendly-recovery generates a bad grub.cfg in a narrow set of conditions

Fri Jul 12 23:45:15 UTC 2019

Public bug reported:

friendly-recovery (on 16.04 at least) runs update-grub in its postinst.
If friendly-recovery and a linux-image get updated within the same apt
command (particularly, linux-image-4.4.0-154-generic and friendly-
recovery 0.2.31ubuntu2.1 hit a few weeks apart) then the update-grub
scans and finds the new kernel, but *no* new initrd (because that
doesn't get handled until the /etc/kernel/postinst.d update gets run)
and so generates a grub.cfg which is syntactically valid and passes
grub-script-check, but will panic on boot because the first menuentry
only has a "linux" line for the 154 kernel and no "initrd" line at all,
so it boots into

   Kernel panic - not syncing: VFS: unable to mount root fs on unknown-
block(0,0)

instead of a more graceful/less frightening grub error.

Under normal conditions this is hard to discover - as long as the rest
of the "dist-upgrade" runs to completion and the initrd actually gets
built, the kernel package tools make sure update-grub gets run again,
and this time it finds the initrd and produces a valid config.  This is
the "narrow" window - any crash after the inaccurate grub.cfg gets
written and before it gets fixed leads to a machine that panics on boot,
though you can extend the window arbitrarily simply by rebooting at the
right time.

I caught it because I was also upgrading another package of my own which
had a buggy /etc/grub.d plugin, so when the kernel postinst.d ran, it
generated a *syntactically* invalid grub.cfg, which was discarded by
update-grub but did *not* fail the dist-upgrade, just produced more text
among a bunch of other text (fortunately for debugging, this was visible
in /var/log/apt/term.log later on.)  Normally this discard mechanism
saves the day, because any previous grub.cfg can be assumed bootable,
but in this case it just forces the window for this bug entirely open.

Is this obscure and hard to trigger? Yes.
Can the end user recover from it? As long as they're actually in front of the machine to select an alternate grub menu item, and have an older kernel (which is likely, since this needs a linux-image package upgrade to trigger - if the linux-image upgrade happened in an earlier apt command, friendly-recovery.postinst actually finds an initrd already in place) Yes.

I don't actually have advice on fixing this (the kernel-postinst
mechanism isn't really available for other packages to trigger, and the
direct update-grub dpkg-trigger went away after grub-legacy was replaced
with grub2, which would be the obvious choices) but I think "never write
an invalid grub.cfg" is a reasonable rule...

** Affects: friendly-recovery (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to friendly-recovery in Ubuntu.
https://bugs.launchpad.net/bugs/1836429

Title:
  friendly-recovery generates a bad grub.cfg in a narrow set of
  conditions

Status in friendly-recovery package in Ubuntu:
  New

Bug description:
  friendly-recovery (on 16.04 at least) runs update-grub in its
  postinst.  If friendly-recovery and a linux-image get updated within
  the same apt command (particularly, linux-image-4.4.0-154-generic and
  friendly-recovery 0.2.31ubuntu2.1 hit a few weeks apart) then the
  update-grub scans and finds the new kernel, but *no* new initrd
  (because that doesn't get handled until the /etc/kernel/postinst.d
  update gets run) and so generates a grub.cfg which is syntactically
  valid and passes grub-script-check, but will panic on boot because the
  first menuentry only has a "linux" line for the 154 kernel and no
  "initrd" line at all, so it boots into

     Kernel panic - not syncing: VFS: unable to mount root fs on
  unknown-block(0,0)

  instead of a more graceful/less frightening grub error.

  Under normal conditions this is hard to discover - as long as the rest
  of the "dist-upgrade" runs to completion and the initrd actually gets
  built, the kernel package tools make sure update-grub gets run again,
  and this time it finds the initrd and produces a valid config.  This
  is the "narrow" window - any crash after the inaccurate grub.cfg gets
  written and before it gets fixed leads to a machine that panics on
  boot, though you can extend the window arbitrarily simply by rebooting
  at the right time.

  I caught it because I was also upgrading another package of my own
  which had a buggy /etc/grub.d plugin, so when the kernel postinst.d
  ran, it generated a *syntactically* invalid grub.cfg, which was
  discarded by update-grub but did *not* fail the dist-upgrade, just
  produced more text among a bunch of other text (fortunately for
  debugging, this was visible in /var/log/apt/term.log later on.)
  Normally this discard mechanism saves the day, because any previous
  grub.cfg can be assumed bootable, but in this case it just forces the
  window for this bug entirely open.

  Is this obscure and hard to trigger? Yes.
  Can the end user recover from it? As long as they're actually in front of the machine to select an alternate grub menu item, and have an older kernel (which is likely, since this needs a linux-image package upgrade to trigger - if the linux-image upgrade happened in an earlier apt command, friendly-recovery.postinst actually finds an initrd already in place) Yes.

  I don't actually have advice on fixing this (the kernel-postinst
  mechanism isn't really available for other packages to trigger, and
  the direct update-grub dpkg-trigger went away after grub-legacy was
  replaced with grub2, which would be the obvious choices) but I think
  "never write an invalid grub.cfg" is a reasonable rule...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/friendly-recovery/+bug/1836429/+subscriptions