Android 2GB Instability Issue

psidhu · ‎01-05-2015

Hi All,

I am having issues with any sort of stability on a custom board (based on SabreSD reference design) with 2GB of DDR3 SDRAM. That is, 4GB density parts X 4 chips via a single chip select (CS0). Part number: MIC MT41K256M16HA-125 IT:E (1600 speed grade) if anyone is interested.

Notable board specs:

Processor: IMX6q

Memory : 2GB

This issue is present on both Android JB4.3 (Based on jb4.3_1.1.0-ga) and Android KK4.4 (Based on kk4.4.3_2.0.0-beta). I have not been able to reproduce on our Yocto BSP using a kernel based on the kk4.4.3_2.0.0-beta tagged kernel (3.10.31, same as Android KK4.4 BSP). I have also been unable to reproduce this issue in 16-bit memory mode (a single DDR SDRAM chip) using the same memory part/density using the Android BSP's. Have also been unable to cause this instability using the 128M16 parts (total of 1GB DDR3 SDRAM) with the same processor.

The instability problem is as follows:

On the JB4.3 BSP, the GUI appear fluid to use, but as soon as I enter the "Settings" app, I get a memory dereference problem and the kernel crashes (See null_pointer.log attachment). I can also get the kernel to crash if I click in and out of any app quickly over several iterations (maybe up to 5). I have verified that the fb0base does not overlap with the gpu memory and have tried setting fb0base=0x27b00000 as seen on several other threads related to memory problems in the 3.0.35 Android kernel.

My virtual memory table looks like the following:

Memory policy: ECC disabled, Data cache writealloc
CPU identified as i.MX6Q, silicon rev 1.2
On node 0 totalpages: 474880
free_area_init_node: node 0, pgdat 80970460, node_mem_map 80b0e000
Normal zone: 2848 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 361440 pages, LIFO batch:31
HighMem zone: 1248 pages used for memmap
HighMem zone: 109344 pages, LIFO batch:31
PERCPU: Embedded 7 pages/cpu @81b1d000 s6592 r8192 d13888 u32768
pcpu-alloc: s6592 r8192 d13888 u32768 alloc=8*4096
pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 470784
Kernel command line: enable_wait_mode=off console=ttymxc1,115200 vmalloc=400M consoleblank=0 video=mxcfb0:dev=hdmi,bpp=32,1280x720M@60,if=RGB24 video=mxcfb1:off video=mxcfb2:off video=mxcfb3:off androidboot.hardware=freescale androidboot.bootdev=sdhci-esdhc-imx.2 debug
PID hash table entries: 4096 (order: 2, 16384 bytes)
Dentry cache hash table entries: 262144 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 131072 (order: 7, 524288 bytes)
Memory: 767MB 848MB 240MB = 1855MB total
Memory: 1869836k/1869836k available, 227316k reserved, 442368K highmem
Virtual kernel memory layout:
    vector : 0xffff0000 - 0xffff1000   (   4 kB)
    fixmap : 0xfff00000 - 0xfffe0000   ( 896 kB)
    DMA     : 0xfbe00000 - 0xffe00000   ( 64 MB)
    vmalloc : 0xd9800000 - 0xf2000000   ( 392 MB)
    lowmem : 0x80000000 - 0xd9000000   (1424 MB)
    pkmap   : 0x7fe00000 - 0x80000000   (   2 MB)
    modules : 0x7f000000 - 0x7fe00000   ( 14 MB)
      .init : 0x80008000 - 0x80048000   ( 256 kB)
      .text : 0x80048000 - 0x808e8318   (8833 kB)
      .data : 0x808ea000 - 0x80985640   ( 622 kB)
       .bss : 0x80985664 - 0x80b0d568   (1568 kB)
Preemptible hierarchical RCU implementation.

Besides trying to change the fb0base address, I have tried to change the gpu address in code via

phys = memblock_alloc_base(imx6q_gpu_pdata.reserved_mem_size, SZ_4K, SZ_2G);

I have also tried to change SZ_2G to 0x90000000, to which phys ends up getting an address of 0x85000000.

I've tried running this memory at a lower speed (800 and 1066), verified CS0_END (Tried 0x47 and others), and am finally at a loss. KK4.4 is even worse on this particular board as it never makes it to the GUI at all and instead crashes earlier. I have been focusing on JB4.3 as the issue is likely the same root cause.

----

One more thing I forgot to mention: I'm using a 2g/2g kernel split. I recently found this post for an older version of android using the imx53 processor, changing from a 3g/1g split required new gpulibs binaries that were compiled for the 2g/2g split (Though I don't think this matters for this new kernel/user space).

I have tried the default split of 3g/1g and found that the galcore daemon hangs and the GUI becomes unresponsive. When changing fb0base (in JB4.3) with this split to 0x27b00000, the system has the same null pointer issue as seen in the logs.

One question I have around fb0base: It shouldn't matter what this is if the splash isn't set in u-boot, is this correct?

----

If anyone has any extra questions, please ask me and I will get back to you as soon as I am able. Thank you!

Original Attachment has been moved to: null_pointer.log.zip

psidhu · ‎03-16-2015

Hi Frank,

We've since determined that the cause of this specific problem was an insufficient voltage on the VDD_SOC line. Through trace loss etc, we found that at the IMX itself, the voltage for VDD_SOC was too low by several tens of mV even though the PMIC was providing the correct voltage. The problem was made worse when we put the LDO's in bypass mode, which caused an even further voltage drop on the LDO_SOC line (the actual voltage used internally in the chip).

I would suggest that you look at this voltage line. I would also suggest that you bump your setpoint voltage by ~35mV since I found that the 25mV slop that Freescale added in was, in general, too insufficient. You can see this patch to see what I mean. You can also test this by adding a wire between the trace to mitigate trace loss.

- Pushpal

View solution in original post

psidhu · ‎01-12-2015

While executing some tests to see how the system would respond, I found that when setting the maximum cpu frequency to 396000, the system seems very stable. However, when I push the frequency to 1GHz, the system crashes in the same manner as the original post. Does this give anyone clues as to what the problem might be?

frankburgdorf · ‎03-13-2015

Hi Pushpal Sidhu,

we are facing similar problems here: https://community.freescale.com/message/491460#491460

Did you find a solution to your problem that you like to share with us?

Greetings

Frank

psidhu · ‎03-16-2015

Hi Frank,

We've since determined that the cause of this specific problem was an insufficient voltage on the VDD_SOC line. Through trace loss etc, we found that at the IMX itself, the voltage for VDD_SOC was too low by several tens of mV even though the PMIC was providing the correct voltage. The problem was made worse when we put the LDO's in bypass mode, which caused an even further voltage drop on the LDO_SOC line (the actual voltage used internally in the chip).

I would suggest that you look at this voltage line. I would also suggest that you bump your setpoint voltage by ~35mV since I found that the 25mV slop that Freescale added in was, in general, too insufficient. You can see this patch to see what I mean. You can also test this by adding a wire between the trace to mitigate trace loss.

- Pushpal

frankburgdorf · ‎04-24-2015

Thank you for the hint with the voltage. We finally found the cause for our problem. It was an impedance problem on the address/data lines to the DDR3 RAM. We changed the source impedance in i.mx6 and the problem went away. It was quite tricky to find, as we started with the impedance setup from the Sabre board, which did noct work for us. The layout was similar, though.

leoschwab · ‎01-08-2015

Random Guess: It almost sounds like you have flaky memory, or flaky memory timings. Just for laughs, have you tried running Google's stressapptest on the machine over as much memory as you can grab? It was very helpful identifying our memory issues.

psidhu · ‎01-08-2015

Hi, thanks for responding! I've been toying with the stressapptest in JB4.3. I was able to test as much as 1200MB, but the application always passed. If you have any other insights, please let me know!

leoschwab · ‎01-12-2015

Pushpal Sidhu wrote:

If you have any other insights, please let me know!

Not really, I'm afraid. It sounds like your RAM is probably okay. If you want to test more RAM, you could try adding to the kernel boot command line "init=/bin/bash" (or whatever shell Android uses). This will drop you into a shell immediately after the kernel boots, with none of the apps/daemons started. This should leave you with more free memory which you can hand over to stressapptest for testing.

jamesbone · ‎01-08-2015

We are discussing internally regarding your issue. I apologize if this take some time. Due to the holiday season.

psidhu · ‎01-08-2015

Great, thank you for this response. If any additional information is required, please let me know. I should mention that this board has been run through DRAM calibration from -40C - 80C and has been stress tested with the values we received.

Android 2GB Instability Issue

Android 2GB Instability Issue

Android

i.MX6_All

i.MX6DL

i.MX6Quad