[Hardy] SRU: Fix boot panic on Acer Aspire One (v2)

Andy Whitcroft apw at canonical.com
Wed Jul 15 16:19:17 UTC 2009


On Tue, Jul 14, 2009 at 04:33:52PM +0200, Stefan Bader wrote:
> SRU Justification:
>
> Impact: Certain kernel versions cause a kernel panic on the Acer Aspire
> One when the kernel changes its data to read-only pages. The underlying
> problem is very timing sensitive (adding printks for debugging makes the
> problem much less likely to occur).
>
> Fix: As far as we understand, there might be another thread being
> running (with a very speculative maybe related to freeing init memory)
> which gets badly confused by the global_flush_tlb() (could also be
> affected by the fact that changing the page attributes splits a large
> kernel page). To prevent that we force all other CPUs to run the stop
> work function, which basically acts like a  big sync. The performance
> impact is minimal as this is done only once on boot. The change is
> quirked to only happen on a Acer Aspire One.
>
> Testcase: Booting testkernel on the Acer Aspire One and for comparison
> on a Dell Inspiron 1521. Quirk gets activated and boot succeeds on the
> Acer while the Dell goes through the same code as before.
>

> From 17f5fdd57b5148d56709c6506ccd6637eacde1fe Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader at canonical.com>
> Date: Mon, 13 Jul 2009 15:28:05 +0200
> Subject: [PATCH] UBUNTU: SAUCE: init: Add extra mark_rodata_ro quirk for Acer Aspire One
> 
> BugLink: https://bugs.launchpad.net/ubuntu/+bug/322867
> 
> Fix a very machine and timing dependent problem (as it seems) on the Acer
> Aspire One. It looks like calling global_flush_tlb after changing the kernel
> data to read only can trigger a kernel panic on this platform. Potentially it
> is there for others too, but as it is very timing dependent, other hardware
> just might be faster/slower to run into this.
> Forcing all other parallel threads into a stop experimentally has solved the
> problem and as this path is only run once during boot it will not have much
> impact.
> 
> Signed-off-by: Stefan Bader <stefan.bader at canonical.com>
> ---
>  arch/x86/mm/init_32.c |   32 ++++++++++++++++++++++++++++++++
>  1 files changed, 32 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 05180bb..92135ab 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -31,6 +31,8 @@
>  #include <linux/memory_hotplug.h>
>  #include <linux/initrd.h>
>  #include <linux/cpumask.h>
> +#include <linux/dmi.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/processor.h>
>  #include <asm/system.h>
> @@ -816,6 +818,22 @@ static int noinline do_test_wp_bit(void)
>  
>  #ifdef CONFIG_DEBUG_RODATA
>  
> +static int mark_rodata_ro_stop_work(void *data)
> +{
> +	return 0;
> +}
> +
> +static struct dmi_system_id mark_rodata_ro_table[] = {
> +	{ /* Handle boot Oops on Acer Aspire One */
> +		.ident = "Acer Aspire One",
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Acer"),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "AOA"),
> +		},
> +	},
> +	{ }
> +};
> +
>  void mark_rodata_ro(void)
>  {
>  	unsigned long start = PFN_ALIGN(_text);
> @@ -844,7 +862,21 @@ void mark_rodata_ro(void)
>  	 * We do this after the printk so that if something went wrong in the
>  	 * change, the printk gets out at least to give a better debug hint
>  	 * of who is the culprit.
> +	 *
> +	 * https://bugs.launchpad.net/ubuntu/+bug/322867
> +	 *
> +	 * Calling global_flush_tlb() at this point on Acer Aspire One seems
> +	 * to trigger a panic if the timing is right. Delays caused by printk
> +	 * statements made the panic less likely. The panic itself looks like
> +	 * some other function is running in parallel at that time and seems
> +	 * to be loosing the stack. There is no final explanation to this but
> +	 * it looks like forcing the other CPUs/HTs out of work fixes the
> +	 * problem without much risk for regression.
>  	 */
> +	if (dmi_check_system(mark_rodata_ro_table)) {
> +		printk(KERN_INFO "Adding stop_machine_run call\n");
> +		stop_machine_run(mark_rodata_ro_stop_work, NULL, NR_CPUS);
> +	}
>  	global_flush_tlb();
>  }
>  #endif

Though scarey that this is needed at all if its fixing that machine then
the code looks sensibly limited to that machine only.  The impact as you
say should be near 0 from a user perspective.

ACK

I believe we only see this in Hardy right?

-apw




More information about the kernel-team mailing list