[Bug 214814] [NEW] BUG: soft lockup - CPU#0 stuck for 61s!

TJ ubuntu at tjworld.net
Wed Apr 9 21:38:54 UTC 2008


Public bug reported:

See also upstream bug:

http://bugzilla.kernel.org/show_bug.cgi?id=10396

Systems based on the Intel 450NX chipset may experience issues where
devices aren't recognised that lead to drivers failing, unhandled IRQs,
and other serious boot failures. The issue is caused because this
chipset has 3 PCI root buses. When it was first released some operating
systems (read: Windows NT) didn't always correctly discover the 2nd and
3rd PCI buses. As a result the PCI BIOS tables were 'hacked' to have a
fake bridge device on PCI bus 0 that points to the same bus number as
the 1st bus so they would be scanned correctly by the OS.

$ lspci
00:0a.0 PCI bridge: Intel Corporation 21154 PCI-to-PCI Bridge
00:10.0 Host bridge: Intel Corporation 450NX - 82451NX Memory & I/O Controller (rev 03)
00:12.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
00:13.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
00:14.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)

As a result, in a well-behaved OS the 2nd and 3rd PCI buses would be
scanned twice. Once as secondaries of the 1st bus, and then as root
buses in their own right. This caused problems with devices being
discovered twice.

A fix-up for all i450N chipsets was introduced in
arch/i386/pci/fixups.c::pci_fixup_i450nx(). Note: arch/i386 was
refactored to arch/x86/ subsequently. The fix-up checks the PCI config
for the subsidiary buses and if it finds them scans them. This adds them
to the root_pci_bus list. Later in the boot process the ACPI/PCI code
reads the ACPI DSDT table, finds the PCI bus entries (PNP0A03) and tries
to scan them. It fails when scanning the 2nd and 3rd buses with:

[    0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02)
[    0.912085] ACPI: Bus 0000:02 not present in PCI namespace
[    0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03)
[    0.920085] ACPI: Bus 0000:03 not present in PCI namespace

Unfortunately, the report is misleading since the reason is that the bus
is found to be already registered and therefore ignored. The situation
can be worked around by booting with "pci=noacpi".

The solution is to make the pci_fixup_i450nx() code selective based on
the DMI of the system. I've introduced a patch that does this. Initially
the only DMI it will match is Dell PowerEdge 6300 but if other systems
are found to be affected the output of "sudo dmidecode" should be
captured and reported. Additional DMI_MATCH entries can then be added to
the patch.

I found this reference to the issue in AKM's 2.6.0 mm tree and the
linux-scsi mailing list archive:

"I can tell you what's going on here.  This is a 450NX based
motherboard.  The 450NX chipset from Intel was the first chipset to have
peer PCI busses.  For backwards compatibility, some machine makers
hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that
points to the same bus number as the peer bus.  This way if the OS
didn't know about the peer bus registers it would still find the devices
by scanning behind the bridge.  In this case we are scanning behind this
fake bridge and then also scanning based upon the peer bus registers in
the chipset, and as a result we are finding the device twice.  In order
to fix this problem you need to change the peer bus quirk code for the
450NX chipset to scan the list of bus 0 devices looking for a bridge
that has the same config as the peer bus registers and if so delete the
bridge from the list.  That will avoid double scanning and will avoid
having the PCI code try and configure sub busses via a fake bridge when
it should do all configurations via the 450NX peer bus registers.

-- 
  Doug Ledford <dledford at redhat.com>"

http://marc.info/?l=linux-scsi&m=106839680416899&w=2

In this particular case a Dell PowerEdge 6300 with a PERC 2 RAID array
controller (aacraid) fails to boot on any kernel after v2.6.20 (Feisty).
Reports show:

[ 0.000000] Linux version 2.6.24-15-generic (root at PowerEdge6300) (gcc
version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #1 SMP Fri Apr 4 09:18:39 BST
2008 (Ubuntu 2.6.24-15.26-generic)

[ 436.079664] Adaptec aacraid driver 1.1-5[2449]-ms

[ 492.476969] BUG: soft lockup - CPU#2 stuck for 11s! [modprobe:1376]
[ 492.483317]
[ 492.484874] Pid: 1376, comm: modprobe Not tainted (2.6.24-15-generic #1)
[ 492.491642] EIP: 0060:[<c0216641>] EFLAGS: 00000287 CPU: 2
[ 492.497226] EIP is at delay_tsc+0x41/0x50
[ 492.501302] EAX: 0000059e EBX: 0000003f ECX: 00000000 EDX: 0000003f
[ 492.507640] ESI: 17c02b3e EDI: df84f278 EBP: 17c025a0 ESP: df9dfd4c
[ 492.513972] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 492.519443] CR0: 8005003b CR2: 0812574c CR3: 1f97b000 CR4: 00000690
[ 492.525781] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 492.532114] DR6: ffff0ff0 DR7: 00000400
[ 492.536029] [<c02165c6>] __delay+0x6/0x10
[ 492.540264] [<f89496aa>] aac_fib_send+0x21a/0x2d0 [aacraid]
[ 492.546108] [<c012363a>] enqueue_task_fair+0x1a/0x30
[ 492.551318] [<f8945a94>] aac_get_adapter_info+0x74/0x620 [aacraid]
[ 492.557753] [<f8942f54>] aac_probe_one+0x224/0x450 [aacraid]
[ 492.563642] [<f8949b80>] aac_command_thread+0x0/0x6d0 [aacraid]
[ 492.569801] [<c0223136>] pci_device_probe+0x56/0x80
[ 492.574903] [<c027e85e>] driver_probe_device+0x8e/0x190
[ 492.580373] [<c027eace>] __driver_attach+0x9e/0xa0
[ 492.585385] [<c027dc7b>] bus_for_each_dev+0x3b/0x60
[ 492.590491] [<c027e6d6>] driver_attach+0x16/0x20
[ 492.595330] [<c027ea30>] __driver_attach+0x0/0xa0
[ 492.600259] [<c027e00a>] bus_add_driver+0x8a/0x1e0
[ 492.605281] [<c02232e3>] __pci_register_driver+0x53/0xa0
[ 492.610815] [<f8850033>] aac_init+0x33/0x74 [aacraid]
[ 492.616098] [<c0151511>] sys_init_module+0x151/0x1990
[ 492.621377] [<c01778fa>] __do_fault+0x21a/0x410
[ 492.626170] [<c0166421>] handle_fasteoi_irq+0x91/0xf0
[ 492.631465] [<c01053b2>] syscall_call+0x7/0xb
[ 492.636066] =======================

[   17.155571] irq 10: nobody cared (try booting with the "irqpoll" option)
[   17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1
[   17.155571]  [<c025ad74>] __report_bad_irq+0x24/0x80

This was first thought to be part of bug #149071 "-server kernel variant
fails to boot on PowerEdge 2650 with AACRAID timeouts" but it now
appears likely that has a different root cause.

Attached here are patches for Gutsy and Hardy. An upstream patch for
v2.6.25-rc8 is attached to the bugzilla report.

** Affects: linux
     Importance: Unknown
         Status: Unknown

** Affects: linux (Ubuntu)
     Importance: High
     Assignee: TJ (intuitivenipple)
         Status: In Progress

-- 
BUG: soft lockup - CPU#0 stuck for 61s!
https://bugs.launchpad.net/bugs/214814
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to Linux.




More information about the kernel-bugs mailing list