Specific Android 3D operations seem to corrupt the memory

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Specific Android 3D operations seem to corrupt the memory

Jump to solution
2,332 Views
frankburgdorf
Contributor II

We are experiencing memory corruption on i.MX6 Dual (not Lite) with Android 4.3 when using the 3D GPU. Random memory locations are overwritten, mostly with zeroes. The affected locations vary from user memory (resulting in application crashes), kernel data, including page tables (resulting in Oopses in various locations) to kernel code (resulting in illegal instruction traps in internal kernel functions), but also occasionally in framebuffer memory, visible as black pixels.

Hardware is our custom board based on i.MX6 Dual, silicon rev 1.3 (internal revision 5). Memory layout is 1GB of DDR3_x64 memory. We have a 24 Bit LCD with 800 x 480 pixel resolution attached to the parallel LCD interface (no HDMI or LVDS). We have already run the DDR stress test tool[1] from Freescale and applied the resulting timing parameters with no visible change.

The issue can be reproduced with our custom Android distribution based on 4.3 with kernel version 3.16 from kernel.org and also with kernel version 3.10.53 from Freescale (git tag kk4.4.3_2.0.0-ga).

Running the same software on a i.MX6Quad SABRE-SD board (i.MX6 Quad with silicon rev. 1.1) does NOT produce the issue. Running the same software on a Wand board Quad silicon rev. 1.2 (www.wandboard.org) also produces the issue. Even running the original Wand board image "android-4.4.2-wandboard-2014 0815" generates the problem.

How to trigger the problem:

The issue can be triggered by repeatedly starting the built-in Android web browser and rendering the Google homepage[2]. Other 3D rendering operations can also trigger the issue.

Disabling the hardware accelerated UI rendering in Android (build variable USE_OPENGL_RENDERER) prevents the issue. This is however not a viable solution because it makes the UI sluggish and the issue might also be triggered by other (still unknown) operations.

For a test we have applied a kernel patch which marks the kernel code section read-only in the MMU, so accesses by the CPU are trapped and result in a kernel error. The issue still persists (we see illegal instruction traps), so these writes seem not to be triggered by the CPU, but by another SoC engine capable of memory writes, like the 3D GPU.

We have spent a lot of time already into this issue reviewing the hardware, interfaces and power supply. As this problem can be reproduced on the Wand board also, there seems to be some independence from our specific hardware design.

The Wand board does also use the parallel LCD interface. The Sabre is using HDMI. So we disabled the physical Display interface and were running tests doing memory check sums after 3D operations. The problem was still there.

Maybe the i.MX silicon rev. has some effect. We currently do not have any rev 1.1 chip at hand, so we can not make that check on our hardware.

Any suggestions are helpful.

[1] https://community.freescale.com/docs/DOC-96412

[2] https://google.com

Labels (4)
0 Kudos
Reply
1 Solution
1,661 Views
frankburgdorf
Contributor II

The cause for the problems was actually an impedance problem on the DDR3 adress/data lines. We changed the source impedance of the lines in the i.mx6 and the problem was removed. The initial impedance setup was copied from the Sabre board, which did not really work for our board. It was quite difficult to track, as these problems only happened with combined 2D/3D operations.

View solution in original post

0 Kudos
Reply
5 Replies
1,661 Views
psidhu
Contributor III

Hi Frank,

I posted a reply to your inquiry here, but am reposting for completeness here.

----

We've since determined that the cause of this specific problem was an insufficient voltage on the VDD_SOC line. Through trace loss etc, we found that at the IMX itself, the voltage for VDD_SOC was too low by several tens of mV even though the PMIC was providing the correct voltage. The problem was made worse when we put the LDO's in bypass mode, which caused an even further voltage drop on the LDO_SOC line (the actual voltage used internally in the chip).

I would suggest that you look at this voltage line. I would also suggest that you bump your setpoint voltage by ~35mV since I found that the 25mV slop that Freescale added in was, in general, too insufficient. You can see this patch to see what I mean. You can also test this by adding a wire between the trace to mitigate trace loss.

- Pushpal

0 Kudos
Reply
1,661 Views
frankburgdorf
Contributor II

While further analyzing the problem and checking on the MMDC parameters we discovered the Latency Hiding Disable feature (LHD - Bit 18 in MMDC_MDMISC). From the reference manual:

This is a debug feature. When set to "1" the MMDC will handle one read/write access at a time. Meaning that the MMDC pipe-line will be limitted to 1 open access (next AXI address phase will be acknowledged if the current AXI data phase had finished)

Enabling this feature drastically improves the memory corruption situation. While we previously saw memory corruption after about 10-20 starts of the Android browser, we now have several test systems still running after more than 9000 starts. We suspect, this might have a major effect on performance, so we are still looking for a better solution.

0 Kudos
Reply
1,662 Views
frankburgdorf
Contributor II

The cause for the problems was actually an impedance problem on the DDR3 adress/data lines. We changed the source impedance of the lines in the i.mx6 and the problem was removed. The initial impedance setup was copied from the Sabre board, which did not really work for our board. It was quite difficult to track, as these problems only happened with combined 2D/3D operations.

0 Kudos
Reply
1,661 Views
psidhu
Contributor III

That's great, I'm glad you found your problem.

0 Kudos
Reply
1,661 Views
igorpadykov
NXP Employee
NXP Employee

Hi Frank

one can try further to narrow down issue, trying  kernel command line cma parameter

kernel hangs or fails to allocate CMA in galcore with i.MX Solo Sabre

Linux kernel sometimes lock for a few instants

with android-4.4.2  disabling wait mode : enable_wait_mode=off and

decreasing operating frequency (for reducing noise) arm_freq=800, ldo_active=on.

Also one can recheck that for Dual used parameter: maxcpus=2

Best regards

igor

0 Kudos
Reply