Machine check exception, but what kind?

Joel Rees joel.rees at gmail.com
Fri Apr 21 22:23:52 UTC 2017


On Sat, Apr 22, 2017 at 1:31 AM, Kevin O'Gorman <kogorman at gmail.com> wrote:
> I've been having trouble with two of my personal computers.  One is from
> System76 and their great support staff suggested I load package mcelog to
> monitor for machine check exceptions (MCE).  Sounded good to me, so I did it
> on all my Ubuntu machines (I have 4 if you count laptops).

I assume you have been reading

    https://www.mcelog.org/

I'm seeing a lot of useful information there. Maybe I'll try it out.

If you haven't read the manpage and the FAQ, ...

> Lo and behold, one of the other machines glitched last night.  Not the
> System76 one, but a home-brew I built myself (with a little help from my
> friends).  It's got a medium-fast Core i-7 on an ASUS board. It was a
> familiar occurrence:
> - It had rebooted on its own and when I woke up it was asking me to log in
> - On logging in I saw two popup dialogs that said there was an error
> detected by a system program (but absolutely no other information about it)
> and wanted permission to report it.  Even when I gave that permission, I did
> not get a copy or any further information about what happened.

Did you read the page on triggers? (Mentioned also in the FAQ.)

> - /var/log/syslog showed the reboot sequence, but nothing particularly
> helpful about the cause.
>
> Pretty frustrating, but because I had installed mcelog, I also got this:
> - /var/log/mcelog contained this:
> mcelog: failed to prefill DIMM database from DMI data

I saw something about that in the FAQ.

> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 4
> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
> TIME 1492751851 Thu Apr 20 22:17:31 2017
> MCG status:
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> Running trigger `unknown-error-trigger'
> STATUS be00000000800400 MCGSTATUS 0
> MCGCAP c09 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 60

> Hardware event. This is not a software error.
> MCE 1
> CPU 3 BANK 3
> MISC 7fbc6369a0eb ADDR 7fbc6369a0eb
> TIME 1492751851 Thu Apr 20 22:17:31 2017
> MCG status:
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> Running trigger `unknown-error-trigger'
> STATUS be00000000800400 MCGSTATUS 0
> MCGCAP c09 APICID 6 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 60
>
> So it looks like a hardware error.  It even says so, or at least "Hardware
> event. This is not a software error."

Two, in fact.

> Thing is the rest of this log is almost entirely opaque to me.  I do
> understand the timestamp and "Vendor Intel" but that's about it.  I'm
> wondering what actually happened, and if there's anyone on this list that
> can explain.  In particular, does that first line, containing "DIMM" suggest
> that there was a RAM memory-related problem?

It does, but did you check the glossary?

> I also wanted to alert anyone else who might be having trouble diagnosing a
> recurring problem.  This package is in the regular repository, but is not
> installed by default.  I think that's a shame.

You might want to look up EDAC. I see it mentioned in the FAQ.


> --
> Kevin O'Gorman
> #define QUESTION ((bb) || (!bb))   /* Shakespeare */
>
> Please consider the environment before printing this email.
>

Happy hunting.

-- 
Joel Rees

I'm imagining I'm a novelist:
http://joel-rees-economics.blogspot.com/2017/01/soc500-00-00-toc.html
More of my delusions:
http://reiisi.blogspot.jp/p/novels-i-am-writing.html




More information about the ubuntu-users mailing list