[Precise pre-up] Pick k(un)map_atomic fix

Stefan Bader stefan.bader at canonical.com
Thu Dec 1 16:04:17 UTC 2011

This has been verified in older releases and we carry it in Natty
and in Oneiric. Unfortunately its upstreaming into 3.2 is not
certain for rather procedural reasons. Quoting Andrew (not amused)

"I sent this patch to the x86 maintainers two weeks ago.  It was
ignored, as were the other 11 patches I sent.  Later I will resend them
all.  If they are again ignored I will later send them yet again, and
so on."

But since it has been verified to cure rather nasty failures in the
cloud (Xen) we should put it into Precise right now. If it lands
upstream in time it just can be rebase out of existence.

Maybe it needs "UBUNTU SAUCE:" tagging...


>From b39e4363068122a5d36a26cc656c365d2341c1d8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk at oracle.com>
Date: Wed, 30 Nov 2011 15:03:08 +1100
Subject: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be
 synchronous, regardless of lazy_mmu mode

Fix an outstanding issue that has been reported since 2.6.37.  Under a
heavy loaded machine processing "fork()" calls could crash with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range() we turn lazy mode on, and then in
swap_entry_free() we call swap_count_continued() which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.

Looking at kmap_atomic_prot_pfn(), it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot() and __kunmap_atomic() makes the problem
go away.

Interestingly, commit b8bcfe997e4615 ("x86/paravirt: remove lazy mode in
interrupts") removed part of this to fix an interrupt issue - but it went
to far and did not consider this scenario.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk at oracle.com>
Cc: Thomas Gleixner <tglx at linutronix.de>
Cc: Ingo Molnar <mingo at redhat.com>
Cc: "H. Peter Anvin" <hpa at zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra at chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge at citrix.com>
Cc: <stable at kernel.org>
Signed-off-by: Andrew Morton <akpm at linux-foundation.org>

BugLink: http://bugs.launchpad.net/bugs/854050
(cherry-picked from b39e4363068122a5d36a26cc656c365d2341c1d8 linux-next)
Signed-off-by: Stefan Bader <stefan.bader at canonical.com>
 arch/x86/mm/highmem_32.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index b499626..f4f29b1 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -45,6 +45,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
 	set_pte(kmap_pte-idx, mk_pte(page, prot));
+	arch_flush_lazy_mmu_mode();
 	return (void *)vaddr;
@@ -88,6 +89,7 @@ void __kunmap_atomic(void *kvaddr)
 		kpte_clear_flush(kmap_pte-idx, vaddr);
+		arch_flush_lazy_mmu_mode();
 	else {

More information about the kernel-team mailing list