DANGER!!! Problems with 10.04 installer (RAID devices *will* get corrupted)

Alvin Thompson alvin at thompsonlogic.com
Sat Apr 24 05:07:00 UTC 2010


Top-posting for (my) convenience on this one...

It's nice to see someone actually trying to recreate the bug instead of 
just flapping their gums trying to sound smart.  If you have the time, 
could you try recreating it again?   I have some suggestions to make it 
more like my scenario (in fact, number 3 below is required for the 
problem to occur, and number 4 below is likely to be required).  I know 
these suggestions are long, but it would be appreciated.  In for a 
penny, in for a pound, eh?  You'll have to do the first install again, 
but you won't have to actually go through with the second install.  You 
can cancel after getting past the partitioning screen.  I've noticed 
that when things go awry there are two tell-tale signs:

1. On the partitioning screen, the original (now defunct) file system(s) 
will be detected and show up.

2. Once you select "finish partitioning", the installer will show a list 
of partition tables that will be modified.  One or more RAID partitions 
will show up on the list, irregardless of that fact that you didn't 
select them for anything.

If those signs are present, the RAID array will be hosed.  If the signs 
are not there, the install will go fine and there's no need to continue. 
  Additionally, you don't have to worry about test data or even mounting 
the RAID array.

When doing the second install, if you post exactly which file systems 
were detected on the manual partitioning screen and which partitions 
were shown on the "to be modified" list once you hit "finish 
partitioning", I'd appreciate it.  Now on to the suggestions:

1. It sounds like in your setup you installed for the second time after 
setting up the RAID array, but before the array finished resyncing for 
the first time.  In my setup, the array had been around for a while and 
was fully resynced.  In fact, I (likely) waited for the array to be 
fully resynced before even installing XFS on it.  If you *did* wait for 
the drive to finish resyncing before the second install,  please RSVP 
because your array was indeed corrupted, but since it was only one drive 
the array was somehow able to resync and recover.

2. I was using the ubuntu 10.4 beta 2 server 64-bit disk for the 
(second) install when things went south.  Could you try that one?

3. REQUIRED.  It sounds like when doing the second install, you just 
installed to an existing partition.  In order for the problem to occur, 
you have to remove/create a partition (even though you're leaving the 
RAID partitions alone).  If you recreate the partitions I used (6, 
below), this will be taken care of.

4. POSSIBLY REQUIRED.  When creating the RAID array with the default 
options as you did in your first test, by default the array is created 
in degraded mode, with a drive added later and resynced. This makes the 
initial sync faster.  Since I'm a neurotic perfectionist, I always 
create my arrays with the much more manly and macho "--force" option to 
create them "properly".  Its very possible that doing the initial resync 
with a degraded array will overwrite the defunct file system, while 
doing the initial sync in the more "proper" way will not.  Please use 
the "--force" option when creating the array to take this possibility 
into account.

5. My array has 4 drives.  If possible, could you scare up a fourth 
drive?  If not, don't worry about it.  Especially if you do number 7.

6. Prior to/during the infamous install, my partitions were as follows. 
  If feasible, recreating as much of this as possible would be appreciated:

    sd[abcd]1 25GB
    sd[abcd]2 475GB

    My RAID5 array was sd[abcd]2 set up as md1, and my file systems were:

    sda1:     ext4    /
    md1:      xfs     /data
    sd[bcd]1: (partitioned, but not used)

    Note I had no swap partition originally.

    On the manual partition screen of the ill-fated install, I left the 
sd[abcd]2 partitions alone (RAID array), deleted all the sd[abcd]1 
partitions, and created the following partitions:

    sd[abcd]1:  22GB   RAID5   md2   /
    sd[abcd]3*:  3GB   RAID5   md3   (swap)

    * POSSIBLY REQUIRED:  Note that partition 2 of the drives' partition 
tables was already taken by the RAID, so I created the 3GB partitions as 
partition 3, even though the sectors in partition 3 physically resided 
before the sectors in partition 2.  This is perfectly "legal", if not 
"normal" (fdisk will warn you about this).  Please try to recreate this 
condition if you can, because it's very possible that was the source of 
the problems.

    BTW, all partitions were primary partitions.

    If you don't have that much space, you can likely get away with 
making sd[abcd]2 as small as needed.

7. I simplified when recreating this bug, but in my original scenario I 
had 2 defunct file systems detected by the installer: one on sda2 and 
one on sdd2 (both ext4).  That's why I couldn't just fail and remove the 
corrupted drive even if I had known to do so at that point.  I figure 
the more defunct file systems there are, the more chances you have of 
recreating the bug.  So how about creating file systems on all four 
partitions (sd[abcd]2) before creating the RAID array?

8. My original setup left the RAID partitions' type as "linux" instead 
of "RAID autodetect".  It's no longer necessary to set the partition 
type for RAID members, as the presence of the RAID superblock is enough. 
  When recreating the problem I did set the type to "RAID autodetect", 
but to be thorough, try leaving the type as "linux".

9. If you *really* have too much time on your hands, my original ubuntu 
install, used for creating the original file systems, was 8.10 desktop 
64 bit.  I created the non-RAID during the install and the RAID array 
after the install, after apt-getting mdadm.  I seriously doubt this 
makes a difference though.

10. I was using an external USB DVD-ROM drive to due the install.  It's 
very remotely possible since the drive has to be re-detected during the 
install process, it could wind up reshuffling the device letters.  If 
you have an external CD or DVD drive, could you try installing with it?

If you (or anybody) can try recreating the problem with this new 
information I'd very much appreciate it.

Thanks,
Alvin


On 04/23/2010 02:25 PM, J wrote:
> Somehow, my reply to Alvin's original post ended up tacked on to the
> spinoff thread... so here it is, hopefully attached to the correct
> thread (I blame GMail's wonky ability to handle threads)
>
> Long reply below:
>
> On Wed, Apr 21, 2010 at 00:30, Alvin Thompson<alvin at thompsonlogic.com>  wrote:
>> Long story short: the only way to be safe right now is to physically
>> remove drives with important data during the install.
>>
>> I figured out the cause of my RAID problems, and it's a problem with
>> ubuntu's installer.  This will cost people their data if not fixed.
>> Sorry about the length of this post, but the problem takes a while to
>> explain.
>
> FWIW, this is what I just went through, step by step to try to
> recreate a loss of data on an existing sofware raid array:
>
> 1: Installed a fresh Karmic system on a single disk with three partitions:
> /dev/sda1 = /
> /dev/sda2 = /data
> /dev/sda3 = swap
>
> all were primary partitions.
>
> 2: After installing 9.10, I created some test "important data" by
> copying the contents of /etc into /data.
> 3: For science, rebooted and verified that /data automounted and the
> "important data" was still there.
> 4: Shut the system down and added two disks.  Rebooted the system.
> 5: Moved the contents of /data to /home/myuser/holding/
> 6: created partitions on /dev/sdb and /dev/sdc (the two new disks, one
> partiton each)
> 7: installed mdadm and xfsprogs, xfsdump
> 8: created /dev/md0 with mdadm using /dev/sda2, /dev/sdb1 and
> /dev/sdc1 in a RAID5 array
> 9: formatted the new raid device as xfs
> 10: configured mdadm.conf and fstab to start and automount the new
> array to /data at boot time.
> 11: mounted /data (my new RAID5 array) and moved the contents of
> /home/myuser/holding to /data (essentially moving the "important data"
> that used to reside on /dev/sda2 to the new R5 ARRAY).
> 12: rebooted the system and verified that A: RAID started, B: /data
> (md0) mounted, and C: my data was there.
> 13: rebooted the system using Lucid
> 14: installed Lucid, choosing manual partitioning as you described.
> **Note: the partitioner showed all partitions, but did NOT show the
> RAID partitions as ext4
> 15: configured the partitioner so that / was installed to /dev/sda1
> and the original swap partition was used. DID NOT DO ANYTHING with the
> RAID partitions.
> 16: installed.  Installer only showed formatting /dev/sda1 as ext4,
> just as I'd specified.
> 17: booted newly installed Lucid system.
> 18: checked with fdisk -l and saw that all RAID partitions showed as
> "Linux raid autodetect"
> 19: mdadm.conf was autoconfigured and showed md0 present.
> 20: edited fstab to add the md0 entry again so it would mount to /data
> 21: did an mdadm --assemble --scan and waited for the array to rebuild
> 22: after rebuild/re-assembly was complete, mounted /data (md0)
> 23, verified that all the "important data" was still there, in my
> array, on my newly installed Lucid system.
>
> The only thing I noticed was that when I did the assembly, it started
> degraded with sda2 and sdb1 as active and sdc1 marked as a spare with
> rebuilding in progress.
>
> Once the rebuild was done was when I mounted the array and verified my
> data was still present.
>
> So... what did I miss in recreating this failure?
>





More information about the ubuntu-users mailing list