[SRU][Bionic][PATCH 1/1] powerpc/mce: Fix a bug where mce loops on memory UE.
Joseph Salisbury
joseph.salisbury at canonical.com
Tue Jun 5 16:21:58 UTC 2018
From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
BugLink: http://bugs.launchpad.net/bugs/1774964
The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.
Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.
Without this patch we see:
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
...
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
...
Watchdog CPU:38 Hard LOCKUP
After this patch we see:
Severe Machine check interrupt [Not recovered]
NIP: [00007fffaae585f4] PID: 7168 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffaafe28ac
Physical address: 00002017c0bd0000
find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable at vger.kernel.org # v4.15+
Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora at gmail.com>
Reviewed-by: Balbir Singh <bsingharora at gmail.com>
Signed-off-by: Michael Ellerman <mpe at ellerman.id.au>
(cherry picked from commit 75ecfb49516c53da00c57b9efe48fa3f5504a791)
Signed-off-by: Joseph Salisbury <joseph.salisbury at canonical.com>
---
arch/powerpc/kernel/mce_power.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 644f704..aa15142 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -552,7 +552,6 @@ static int mce_handle_ierror(struct pt_regs *regs,
if (pfn != ULONG_MAX) {
*phys_addr =
(pfn << PAGE_SHIFT);
- handled = 1;
}
}
}
@@ -643,9 +642,7 @@ static int mce_handle_derror(struct pt_regs *regs,
* kernel/exception-64s.h
*/
if (get_paca()->in_mce < MAX_MCE_DEPTH)
- if (!mce_find_instr_ea_and_pfn(regs, addr,
- phys_addr))
- handled = 1;
+ mce_find_instr_ea_and_pfn(regs, addr, phys_addr);
}
found = 1;
}
@@ -683,7 +680,7 @@ static long mce_handle_error(struct pt_regs *regs,
const struct mce_ierror_table itable[])
{
struct mce_error_info mce_err = { 0 };
- uint64_t addr, phys_addr;
+ uint64_t addr, phys_addr = ULONG_MAX;
uint64_t srr1 = regs->msr;
long handled;
--
2.7.4
More information about the kernel-team
mailing list