DANGER!!! Problems with 10.04 installer (RAID devices *will* get corrupted)

J dreadpiratejeff at gmail.com
Tue Apr 27 02:22:46 UTC 2010


Not lost interest, just out of town over the weekend.

On Sat, Apr 24, 2010 at 01:07, Alvin Thompson <alvin at thompsonlogic.com> wrote:
> Top-posting for (my) convenience on this one...
>
> It's nice to see someone actually trying to recreate the bug instead of
> just flapping their gums trying to sound smart.  If you have the time,
> could you try recreating it again?   I have some suggestions to make it
> more like my scenario (in fact, number 3 below is required for the
> problem to occur, and number 4 below is likely to be required).  I know
> these suggestions are long, but it would be appreciated.  In for a
> penny, in for a pound, eh?  You'll have to do the first install again,
> but you won't have to actually go through with the second install.  You
> can cancel after getting past the partitioning screen.  I've noticed
> that when things go awry there are two tell-tale signs:

Be warned, I have to do this in a virtualized environment as I don't
have enough physical hardware to recreate this.  My only multi-disk
system is my home fileserver and I don't really fancy tearing that
apart and rebuilding it...  so keep in mind that I'm doing this in VM
space... but given that this is software RAID that shouldn't matter as
the OS doesn't know it's a VM.

Rest of my reply is inline:

> 1. On the partitioning screen, the original (now defunct) file system(s)
> will be detected and show up.
>
> 2. Once you select "finish partitioning", the installer will show a list
> of partition tables that will be modified.  One or more RAID partitions
> will show up on the list, irregardless of that fact that you didn't
> select them for anything.
>
> If those signs are present, the RAID array will be hosed.  If the signs
> are not there, the install will go fine and there's no need to continue.
>  Additionally, you don't have to worry about test data or even mounting
> the RAID array.
>
> When doing the second install, if you post exactly which file systems
> were detected on the manual partitioning screen and which partitions
> were shown on the "to be modified" list once you hit "finish
> partitioning", I'd appreciate it.  Now on to the suggestions:
>
> 1. It sounds like in your setup you installed for the second time after
> setting up the RAID array, but before the array finished resyncing for
> the first time.  In my setup, the array had been around for a while and
> was fully resynced.  In fact, I (likely) waited for the array to be
> fully resynced before even installing XFS on it.  If you *did* wait for
> the drive to finish resyncing before the second install,  please RSVP
> because your array was indeed corrupted, but since it was only one drive
> the array was somehow able to resync and recover.

Yes... I rebuild the entire thing from the ground up, this time when I
added three disks (instead of two) I used --force and waited for the
entire thing to sync and become active before putting an XFS
filesystem on it.  After that point, I mounted /dev/md0 and copied
about 2GB of data to the new array.

> 2. I was using the ubuntu 10.4 beta 2 server 64-bit disk for the
> (second) install when things went south.  Could you try that one?

Sadly, I can not.  I no longer have access to the beta 2 ISOs (not
even sure where they are online at this point), however, that's not
necessarily a bad thing in and of itself.  If this issue was resolved
by RC, (though the first time I tried this I DID use a Beta 2 ISO)
then it may be ok... BUT, I do understand the need to recreate the
original failure as well, because using a later version of the code
introduces a new variable that was not present originally.

But, it is what it is.

> 3. REQUIRED.  It sounds like when doing the second install, you just
> installed to an existing partition.  In order for the problem to occur,
> you have to remove/create a partition (even though you're leaving the
> RAID partitions alone).  If you recreate the partitions I used (6,
> below), this will be taken care of.

Aye, that I did.  This time around, I followed what you outlined and
deleted the original sda1 and created a couple new arrays for the new
10.04 install.

> 4. POSSIBLY REQUIRED.  When creating the RAID array with the default
> options as you did in your first test, by default the array is created
> in degraded mode, with a drive added later and resynced. This makes the
> initial sync faster.  Since I'm a neurotic perfectionist, I always
> create my arrays with the much more manly and macho "--force" option to
> create them "properly".  Its very possible that doing the initial resync
> with a degraded array will overwrite the defunct file system, while
> doing the initial sync in the more "proper" way will not.  Please use
> the "--force" option when creating the array to take this possibility
> into account.

This time around, I created the first array from sd[abcd]2 11.5GB each
using --force.

I also waited until the array had fully built before putting an XFS
filesystem on it, mounting and loading with about 2GB of data.

> 5. My array has 4 drives.  If possible, could you scare up a fourth
> drive?  If not, don't worry about it.  Especially if you do number 7.

Did that... 1 20GB disk as sda, 3 20GB disks as sd[bcd]

> 6. Prior to/during the infamous install, my partitions were as follows.
>  If feasible, recreating as much of this as possible would be appreciated:
>
>    sd[abcd]1 25GB
>    sd[abcd]2 475GB

This second attempt I started with:

sda1 10GB
sda2 10GB

>    My RAID5 array was sd[abcd]2 set up as md1, and my file systems were:
>
>    sda1:     ext4    /
>    md1:      xfs     /data
>    sd[bcd]1: (partitioned, but not used)

I've matched that with 9.10:

sda1:  ext4  /
md0:  xfs    /data
sd[bcd]1: partitioned but not used

>    On the manual partition screen of the ill-fated install, I left the
> sd[abcd]2 partitions alone (RAID array), deleted all the sd[abcd]1
> partitions, and created the following partitions:
>
>    sd[abcd]1:  22GB   RAID5   md2   /
>    sd[abcd]3*:  3GB   RAID5   md3   (swap)

I should add that I also set sd[abcd]2 flags to raid

>    * POSSIBLY REQUIRED:  Note that partition 2 of the drives' partition
> tables was already taken by the RAID, so I created the 3GB partitions as
> partition 3, even though the sectors in partition 3 physically resided
> before the sectors in partition 2.  This is perfectly "legal", if not
> "normal" (fdisk will warn you about this).  Please try to recreate this
> condition if you can, because it's very possible that was the source of
> the problems.2
>
>    BTW, all partitions were primary partitions.
>
>    If you don't have that much space, you can likely get away with
> making sd[abcd]2 as small as needed.

So then I set about installing 10.04 RC and following what you posted above,

Once in the partitioner, This is what was listed:

RAID 5 device #0 - 34.4 GB Software RAID device
    #1                  34.4 GB     xfs
                          196.6 kB     unusable
SCSI3 (0,0,0) (sda) - 21.5 GB ATA VBOX HARDDISK
    #1    primary    10.0 GB   B    ext4
    #2    primary    11.5 GB        K    raid
SCSI4 (0,0,0) (sdb) - 21.5 GB ATA VBOX HARDDISK
    #1    primary    10.0 GB
    #2    primary    11.5 GB        K    raid
SCSI5 (0,0,0) (sdc) - 21.5 GB ATA VBOX HARDDISK
    #1    primary    10.0 GB
    #2    primary    11.5 GB        K    raid
SCSI6 (0,0,0) (sdd) - 21.5 GB ATA VBOX HARDDISK
    #1    primary    10.0 GB
    #2    primary    11.5 GB        K    raid

I deleted sda1 and created 2 new partitons from the 10GB of newly free space.

sda1 9GB
sd

and set each to "physical volume for RAID" and set each partiton to be
a primary so no logicals involved here.

Now, my partitioning looked like this:

RAID 5 device #0 - 34.4 GB Software RAID device
    #1                  34.4 GB     xfs
                          196.6 kB     unusable
SCSI3 (0,0,0) (sda) - 21.5 GB ATA VBOX HARDDISK
    #1    primary      9.0 GB        K    raid
    #3    primary      1.0 GB        K    raid
    #2    primary    11.5 GB        K    raid
SCSI4 (0,0,0) (sdb) - 21.5 GB ATA VBOX HARDDISK
    #1    primary      9.0 GB        K    raid
    #3    primary      1.0 GB        K    raid
    #2    primary    11.5 GB        K    raid
SCSI5 (0,0,0) (sdc) - 21.5 GB ATA VBOX HARDDISK
    #1    primary      9.0 GB        K    raid
    #3    primary      1.0 GB        K    raid
    #2    primary    11.5 GB        K    raid
SCSI6 (0,0,0) (sdd) - 21.5 GB ATA VBOX HARDDISK
    #1    primary      9.0 GB        K    raid
    #3    primary      1.0 GB        K    raid
    #2    primary    11.5 GB        K    raid

I then set about creating md1 and md2:

md1: sd[abcd]1 (note that during RAID configuration, NONE of the
existing partitions for md0 were listed as available.  The installer
picked up the metadata and maintained the array (at least it appeared
to be so at this point).
md2: sd[abcd]3

md1 I set up as / using ext4 and (also note that unless I specifically
state a setting, everything else was left as defaults)
md2 set up as swap.

so now I had this:

RAID 5 device #0 - 34.4 GB Software RAID device
    #1                  34.4 GB     xfs
                        196.6 kB      unusable
RAID 5 device #1 - 27.0 GB Software RAID device
    #1                  27.0 GB   f ext4        /
                        196.6 kB      unusable
RAID 5 device #2 - 3.0 GB Software RAID device
    #1                    3.0 GB   f swap      swap
                        196.6 kB      unusable

At that, I selected "write changes to disk" and the next screen shows:

The partition tables of the following devices are changed:
    RAID device #1
    RAID device #2

The following partitions are going to be formatted:
    partition #1 of RAID 5 device #1 as ext4
    partition #1 of RAID 5 device #2 as swap

I hit yes and off she went.

Not once did I see the two fail condition indicators you mentioned
above, but I let the install continue to see what would happen after
installation and reboot.

On reboot, after logging in, mdadm --detail /dev/md0 showed my
original RAID 5 array active and in good health.  Mounting the device
showed that the data copied into it prior to the 10.04 install was
still there and likewise in good health.

> 7. I simplified when recreating this bug, but in my original scenario I
> had 2 defunct file systems detected by the installer: one on sda2 and
> one on sdd2 (both ext4).  That's why I couldn't just fail and remove the
> corrupted drive even if I had known to do so at that point.  I figure
> the more defunct file systems there are, the more chances you have of
> recreating the bug.  So how about creating file systems on all four
> partitions (sd[abcd]2) before creating the RAID array?

I may be willing to do this again, but it'll be a while as I've
neglected a lot of work today to try to recreate this (I'm writing
this after my second attempt today).

> 8. My original setup left the RAID partitions' type as "linux" instead
> of "RAID autodetect".  It's no longer necessary to set the partition
> type for RAID members, as the presence of the RAID superblock is enough.
>  When recreating the problem I did set the type to "RAID autodetect",
> but to be thorough, try leaving the type as "linux".

on my first attempt, I left them as linux and did not see the issue.
on this attempt I set them all to RAID as I created them, and again,
no issue.

> 9. If you *really* have too much time on your hands, my original ubuntu
> install, used for creating the original file systems, was 8.10 desktop
> 64 bit.  I created the non-RAID during the install and the RAID array
> after the install, after apt-getting mdadm.  I seriously doubt this
> makes a difference though.

> 10. I was using an external USB DVD-ROM drive to due the install.  It's
> very remotely possible since the drive has to be re-detected during the
> install process, it could wind up reshuffling the device letters.  If
> you have an external CD or DVD drive, could you try installing with it?

Via the power of VMs, everything has been installed via a CDROM,
however, I could try by using an external DVDROM pushed through...
again, see previous comments regarding time.

> If you (or anybody) can try recreating the problem with this new
> information I'd very much appreciate it.
>
> Thanks,
> Alvin




More information about the ubuntu-users mailing list