[PATCH 0/1] [Hardy] SRU: Disable 4MB page tables for Atom, work around errata AAE44

Sun Feb 21 16:55:31 UTC 2010

On Sat, 2010-02-20 at 17:21 +0100, Stefan Bader wrote:
> Andy Whitcroft wrote:
> > On Fri, Feb 19, 2010 at 03:16:34PM +0000, Colin King wrote:
> >> From: Colin Ian King <colin.king at canonical.com>
> >>
> >> BugLink: https://bugs.launchpad.net/bugs/523112
> >>
> >> SRU Justification:
> >>
> >> Impact: Without this patch Intel(R) Atom (TM) CPUs can sometimes
> >> seemingly randomly get oopses on legitimate executable pages.
> >> This only occurs when splitting large 4MB pages into 4MB pages
> >> so this patch disables 4MB pages for this class of processor.
> >>
> >> Fix: Detect processor type 0x1C and disable 4MB pages.
> > 
> > Do we understand why this is not exhibiting on later kernels, ie. why is
> > this patch only required on Hardy?
> > 
> > -apw
> > 
> 
> Not me, frankly. There has been more locking and safeguarding against other
> threads in the rewritten mm code. So it is well possible that this works around
> the problem in a way I don't understand.
> 
> -Stefan
> 
There have been some extra patches to reduce the impact of this bug.
First just to remind us about AAE44: 

"if a code fetch uses this PDE before the TLB entry for the large page
is invalidated then it may fetch from a different physical address than
specified by either the old large page translation or the new 4-KByte
page translation."

Unfortunately, one if the upstream fixes is racy - on the Atom a
hyperthread may be executing while the PDE is being updated.  One fix
that's not in Hardy is to do the update and then a flush, but there is
still a small time window between these two operations where this bug
may hit the code being executed by the hyperthread. Hence the upstream
fixes are not 100% full proof.

So, the upstream fixes reduce the likelihood of the oopses caused by
this bug but don't fully remove them. In the cases I've worked on, we
have scenarios that either trip the bug on boot (stefan's netbook) or
occasionally oops when constantly exercising the mm subsystem doing page
splits for days on end.

One other thing to consider is that probably newer machines have
microcode fixes from the BIOS which may be a reason why this a widely
observed issue.