Software RAID and races in the boot process
Marius Gedminas
marius at pov.lt
Thu Dec 23 13:42:39 UTC 2004
Hi,
I am trying to configure Ubuntu with software RAID-1 on a server that is
about 1000 km from my physical location. Here's how the setup looks
like:
* two SATA disks (/dev/sda and /dev/sdb) with identical partition
tables (I did sfdisk -d /dev/sda | sfdisk /dev/sdb)
* /dev/sda1 is a regular 2 gig partition containing a custom (i.e. no
desktop) Ubuntu installation. No RAID here, this partition is left
as a backup for recoveries if the main setup gets fubared.
* /dev/sda2 and /dev/sdb2 comprise /dev/md0 which is the root.
* /dev/sda5 and /dev/sdb5 comprise /dev/md1 which is mounted on /home
* /dev/sda6 and /dev/sdb6 comprise /dev/md2 which is mounted on /var
MBR of both disks contain the boot record from the 'mbr' package.
/dev/sda1 contains GRUB that boots into the recovery partition (NB lilo
did not work here for obscure reasons). /dev/md0 (i.e. both /dev/sda2
and /dev/sdb2) contains LILO that boots the system from RAID. I used
LILO because GRUB claims to not support RAID1. /dev/sda2 and /dev/sdb2 are
the only partitions marked as bootable.
The almost system works: BIOS starts the MBR which loads LILO from
/dev/sda2. LILO loads the kernel and initrd. Thanks to judicious use
of dpkg-reconfigure linux-image-$(uname -r) the scripts in the initrd
load raid1.ko and start up /dev/md0 (known as /devfs/md/0 at that point)
with mdadm. The real root filesystem (/dev/md0) is then mounted,
checked, remounted read-write etc.
PROBLEM: boot process stops in S30checkfs.sh with fsck.ext3 claiming
that /dev/md1 and /dev/md2 do not exist. When someone on site comes up
to the console and presses ^D, the system continues to boot and comes up
normally. At that point I can ssh into the system and see that /dev/md1
and /dev/md2 do exist, and, moreso, they are actually mounted.
I suspect that there is a race condition: /etc/rcS.d/S25mdadm-raid
starts up the raid devices, but udev creates the corresponding device
nodes a little bit too late, causing fsck to fail, but subsequent mount
to succeed.
I have tried to reproduce the setup on a machine that I have right here
in the office. It is a much older and slower server (dual 233 Mhz P2
rather than 2.8 GHz P4). It does not fail in fsck, but S25mdadm-raid
prints a couple of interesting error messages:
* Starting RAID devices... [done]
mdadm: error opening /dev/md1?: No such file or directory
mdadm: error opening /dev/md1?: No such file or directory
* Setting up LVM volume groups... [done]
...
I suspect that this a symptom of the same problem, even though /dev/md1
is apparently created soon enough for fsck to succeed.
Can someone who understands udev and related issues tell me whether my
suspicion about the race condition is plausible? Should I file a bug in
bugzilla.ubuntu.com, and if so, for what package?
In the mean time I will either disable fsck by changing the last column
in /etc/fstab from 2 to 0, or try to add 'sleep 5' after mdadm starts
but before fsck runs, or both.
Marius Gedminas
--
"I may not understand what I'm installing, but that's not my job. I
just need to click Next, Next, Finish here so I can walk to the next
system and repeat the process"
-- Anonymous NT Admin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20041223/a4b2593c/attachment.sig>
More information about the ubuntu-users
mailing list