MD RAID-1 deadlocks ?

Anders Karlsson trudheim at gmail.com
Thu Dec 8 07:05:08 UTC 2005


On 12/8/05, Scott Henson <scotth at csee.wvu.edu> wrote:
> Anders Karlsson wrote:
>
> >Hi,
> > - tried kernel 2.6.14 (took config from Ubuntu's 2.6.12 and went from there)
> > - tried kernel compiled with gcc 4.0 and with gcc 3.4, no difference
> > - tried with 2.6.14.3, kdb 4.4 and serial console and all available
> >debug options (no Oops anywhere, just a straight locked up box)
> > - currently using 2.6.15-rc4 with kdb4.4, most things compiled in to kernel
> >
> >
>
> I use raid 1 all the time and Ive yet to see a hard lock that can be
> attributed to it.  My first stab at it is to try the ubuntu default
> kernel(linux-686 or linux-k7 which ever applies to your machine).  If
> that locks up too, I would also ask you were your swap space is.  You
> don't seem to mention it.  Also, where are the disks attached?  I see
> you have 2 IDE raid controllers, 1 ide controller, and 2 scsi controllers?

I have tried the Ubuntu default kernel and had the same problems with that.
The swap space in an LV in the VG on md1, so perhaps not in the best
spot, however, I ran bonnie++ with swap both disabled and enabled last
night, and no lockups. I compiled a kernel several times both with and
without swap active. Kernel compile has been a surefire way to trigger
this problem before. Last night however, no problems at all.

The two IDE RAID controllers are the on-board SATA controller (unused)
and a Silicon Image PCI card (fakeraid, card used as plain two channel
IDE controller) where the two disks are attached. The onboard PATA
controller in where I have the DVD±RW drive attached.

The SCSI controller is a PCI card, two separate channels on that as
well (hence why it looks like there are two) but nothing attached to
that at the moment.

> At first blush it sounds like a memory problem.  But, Ive also seen this
> kind of thing with a bad south bridge.  Maybe check the mother board to
> make sure you haven't let some smoke out somewhere.  Also, if you can
> reinstall, try with out the raid and see if you still see the same
> crashes.  I truly suspect a hardware problem.

I thought the same, but memcheck has thrown up no problems with it. I
suppose it could be intermittant.
A bad south bridge is a possible option, but with the bonnie++ runs
last night, I was hoping to trigger the crash to pinpoint it. No joy
as I have 24h30min uptime at the moment (not had that long uptime for
over a month).

Smartmond does not find a problem with the two raided disks, apart
from seek time performance fluctuating, so I guess the disks are
alright.

I will try it without the raid, but even with the raid it has been
stable for longer than usual, so I wonder if me rummaging around in
the box reseating the CPU, applying new heatpaste etc in some way
caused a delayed fix. After all that, it was still crashing, but has
stabilised with a bit of time.

I am thoroughly confused by all this, as what would trigger the lockup
within sometimes as little as seconds now can be run several times,
taking half an hour, does not trigger it. No smoke has been let out
what I can tell, nothing has left scorch marks or gaping holes in the
motherboard yet. ;-) I will keep an eye on things, as I do not believe
there are bugs/problems that go away by themselves.

Many thanks for your suggestions, I will try them as soon as I can. :)

--
Anders Karlsson <trudheim at gmail.com>


More information about the ubuntu-users mailing list