[Maverick] [ti-omap4] SRU: A workaround for highmem issue on OMAP4 platform

Sun Sep 26 06:05:54 UTC 2010

On Fri, Sep 24, 2010 at 03:04:10AM -0300, Ricardo Salveti de Araujo wrote:
> On Fri, Sep 24, 2010 at 12:45:30AM -0300, Ricardo Salveti de Araujo wrote:
> > On Fri, Sep 24, 2010 at 11:21:01AM +0800, Bryan Wu wrote:
> > > SRU Justification:
> > > 
> > > Impact:
> > > There is a critical highmem issue on our latest OMAP4 ES2.0 platform. When we
> > > build kernel package natively on ES2.0 platform with mem=1G and highmem
> > > enabled, we will meet 'Bus Error' corruption from gcc shortly. And 'Unhandled
> > > imprecised external abort' kernel oops messages. Then the whole system will be
> > > very instable.
> > > 
> > > Fix: After some debugging, this issue is related to highmem. If we don't use
> > > mem=1G (no memory in highmem), the corruption is gone. So there is a workaround
> > > which is CONFIG_VMSPLIT_2G=y. So user and kernel memory split is 2G:2G instead
> > > of default 3G:1G. We can use all the 1G memory on ES2.0, but don't put any
> > > memory in highmem. As a result, the issue is gone.
> > 
> > Generally when using highmem we can reproduce this issue quickly, with 10, 15
> > minutes after started the kernel build. Currently without highmem I was able to
> > build the whole kernel 3 times already, and didn't face this issue.
> > 
> > I just started another batch that will run at least more 5 times during this
> > night, and will reply tomorrow with the test result.
> > 
> > Meanwhile we're debugging the highmem issue with Nicolas's help.
> 
> Unfortunatelly something else doesn't seems to be right :-(
> After building it one time successfully with -j 2 I changed to -j 3 and after 10
> minutes I got the following error:
> 
> Bad mode in data abort handler detected
> Internal error: Oops - bad mode: 0 [#1] PREEMPT SMP
> last sysfs file: /sys/devices/virtual/net/lo/type
> Modules linked in: twl4030_pwrbutton sg usb_storage
> CPU: 0    Not tainted  (2.6.35.3+ #52)
> PC is at 0xffff0010 
> LR is at 0x2abab896 
> pc : [<ffff0010>]    lr : [<2abab896>]    psr: 00000097 
> sp : bffcffb0  ip : 00000000  fp : 000b2de4 
> r10: 00000000  r9 : 000ac68c  r8 : 00000050 
> r7 : 00000022  r6 : 004c7108  r5 : 0008cab0  r4 : ca797762 
> r3 : 004d6a68  r2 : 00000037  r1 : 0008cab0  r0 : 00000038 
> Flags: nzcv  IRQs off  FIQs on  Mode ABT_32  ISA ARM  Segment user  
> Control: 10c53c7d  Table: bfffc04a  DAC: 00000015 
> Process dhclient-script (pid: 6199, stack limit = 0xbffce2f8) 
> Stack: (0xbffcffb0 to 0xbffd0000)
> ffa0:                                     00000038 0008cab0 00000037 004d6a68
> ffc0: ca797762 0008cab0 004c7108 00000022 00000050 000ac68c 00000000 000b2de4
> ffe0: 00000000 bffcffb0 2abab896 ffff0010 00000097 ffffffff 8102a021 8102a421
> Code: ef9f0000 ea0000dd e59ff410 ea0000bb (ea00009a)
> 
> And at the userspace side I got a segfault in GCC instead of a bus error.
> 
> Kernel boot log to show that highmem is not being used:
> Kernel command line: splash ro elevator=noop vram=32M root=/dev/sda5 fixrtc console=ttyO2,115200 mem=1G earlyprintk=ttyO2 omapdss.debug=1 loglevel=8 user_debug=16
> PID hash table entries: 4096 (order: 2, 16384 bytes)
> Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
> Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
> allocated 5242880 bytes of page_cgroup
> please try 'cgroup_disable=memory' option if you don't want memory cgroups
> Memory: 1024MB = 1024MB total
> Memory: 975176k/975176k available, 73400k reserved, 0K highmem
> Virtual kernel memory layout:
>     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
>     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
>     DMA     : 0xffc00000 - 0xffe00000   (   2 MB)
>     vmalloc : 0xc0800000 - 0xf8000000   ( 888 MB)
>     lowmem  : 0x80000000 - 0xc0000000   (1024 MB)
>     pkmap   : 0x7fe00000 - 0x80000000   (   2 MB)
>     modules : 0x7f000000 - 0x7fe00000   (  14 MB)
>       .init : 0x80008000 - 0x8003f000   ( 220 kB)
>       .text : 0x8003f000 - 0x806ef000   (6848 kB)
>       .data : 0x80728000 - 0x80798180   ( 449 kB)
> 
> Will also test it with only one cpu to see if this could be realted with SMP
> issues.

Ok, tested the same kernel but running with only one CPU, for 40 hours (what gave me
15 builds), and went all fine, without any errors at both userspace and kernelspace.

So it seems that this data abort exception could be related with concurrency and
SMP support at our kernel.

Cheers,
-- 
Ricardo Salveti de Araujo