AnsweredAssumed Answered

Cache Corruption on MX6UL(L)

Question asked by Ahmad Fatoum on Sep 3, 2019

During work on the barebox bootloader, we noticed that a particular sequence
of events can lead to reliably triggering D-Cache and I-Cache corruption on the
i.MX6 UltraLite and UltraLiteLite and are asking for NXP to take a look at it.

Attached is a self-contained binary that when loaded from 0x9fe00000, parses
an embedded device tree and then at $pc = 0x9fe66ffc resets the SoC by writing
to the watchdog at 0x020bc000. The binary doesn't do any MMIO accesses besides
accessing the serial port at 0x02020000 and the watchdog at 0x020bc000 after a
successful run.
It doesn't do any cache maintenance and shouldn't need to:
it doesn't relocate itself, it does no MMU reconfiguration and no DMA.
This can also be verified by running it in user mode under Linux using the
attached linux-loader.c.

The binary is called after data caches have been flushed and instruction caches
were invalidated, but with the MMU enabled.

Observation on the MCIMX6ULL-EVK
- When run under Linux, the binary reaches the expected location.
- When run from U-Boot with data caches _off_, the binary reaches the expected
location.
- When run from U-Boot with data caches _on_, the binary experiences instruction
and data cache corruption. User visible effects can vary:

* system hangs without serial output
* corrupted strings are printed to the serial console then system hangs
* the U-Boot exception handler is triggered with data abort or undefined
instruction and system resets
* the U-Boot exception handler is triggered, but experiences corrupted
instructions itself and system locks up. Even issuing a CPU halt over JTAG
fails in this case

Steps to reproduce:

1) Flash a SD Card with the 6ul-corruption.sdcard image in the attached zip file.
This image contains the NXP U-Boot as bootloader, as well as two binaries in
the FAT partition: "corruption-yes" and "corruption-no".

 host$ dd if=6ul-corruption.sdcard of=/dev/sdc



2) Load "corruption-yes" with U-Boot and and wait till system hangs:

 => fatload mmc 1:1 0x9fe00000 corruption-yes
reading corruption-yes
870648 bytes read in 160 ms (5.2 MiB/s)
=> dcache flush
=> icache flush
=> go 0x9fe00000
## Starting application at 0x9FE00000 ...
start.c: runtime offset at 0x00000000, text 0x9fe00000, barebox_base=0x9fe00000
start.c: memory at 0x9f800000, size 0x00800000
astart.c: initializing malloc pool at 0x9fb00000 (end 0x9fe00000)
start.c: starting barebox...

>core
uaaaasing boarddata provided DTB
start.c: barebox_arm_boot_dtb: using barebox_boarddata
using boarddata provided DTB
Will either experience cache corruption or continue...
usf1
error



(Note the gibberish in the line before "error", that originates from corrupted data).

3) Reset and boot with data cache off and the output is as expected:

 => fatload mmc 1:1 0x9fe00000
reading corruption-yes
870648 bytes read in 160 ms (5.2 MiB/s)
=> dcache flush
=> dcache off
=> icache flush
=> go 0x9fe00000
## Starting application at 0x9FE00000 ...
start.c: runtime offset at 0x00000000, text 0x9fe00000, barebox_base=0x9fe00000
start.c: memory at 0x9f800000, size 0x00800000
astart.c: initializing malloc pool at 0x9fb00000 (end 0x9fe00000)
start.c: starting barebox...

>core
uaaaasing boarddata provided DTB
start.c: barebox_arm_boot_dtb: using barebox_boarddata
using boarddata provided DTB
Will either experience cache corruption or continue...
1
2
3
4
Reached end successfully

U-Boot 2017.03-00887-g5a61b28d205f (Aug 27 2019 - 08:55:30 +0200)



(System has reset after success output to console).

----

When it's possible to halt the SoC via JTAG, the fact that cache corruption
occurred can sometimes be observed with a JTAG debugger:

- read the memory region at 0x9fe00000+0x80000 from the CPU's viewpoint
(utilizing the caches)
- clean the data cache to the point of coherence
- again, read the memory region at 0x9fe00000+0x80000 from the CPU's viewpoint.

With OpenOCD on an i.MX6UL:

 host$ openocd --log_output 6ul-cache-corruption.log
host$ telnet 127.0.0.1 4444
Open On-Chip Debugger
> halt
> imx6.cpu.0 cache auto 0
> mdw 0x9fe00000 0x80000
> echo "-----clean-----"
> imx6.cpu.0 cache l1 d clean 0x9fe00000 0x80000
> mdw 0x9fe00000 0x80000



The observation is that sometimes the memory dumps differs in a cache line, which
implies that after cleaning, there remained non-dirty cache lines, which content
differs from what's in in a lower-level cache, i.e. they had been corrupted:

 --- mem-pre-clean 2019-08-22 16:00:13.040244665 +0200
+++ mem-post-clean 2019-08-22 16:00:12.068260582 +0200

-0x9fe00100: 4606fffb f06fb928 46500a0b e8bdb009 46018ff0 f03e4628 4683f899 d0392800
-0x9fe00120: 4606fffb f06fb928 46500a0b e8bdb009 46018ff0 f03e4628 4683f899 d0392800
+0x9fe00100: 49214a20 f80cf03e b9e04682 463b68e2 f113a904 d21a33ff 46436922 33fff113
+0x9fe00120: 2301d21a 9300aa04 9b024658 f03d4917 4682ffbc 4240b1c8 fdbcf000 46024914



When dumping L1 I-Cache or L1 D-Cache with MCR p15, 3, , c15, c4
unexpected invalid instructions can also be observed.


Also in the FAT partition is corruption-no, which differs in that just two
instructions have been swapped:

 --- corruption-yes.thumb 2019-08-27 13:00:11.119154084 +0200
+++ corruption-no.thumb 2019-08-27 13:00:11.359152538 +0200
@@ -124494,6 +124494,6 @@
3e662: 4614 mov r4, r2
3e664: bebe bkpt 0x00be
- 3e666: 2201 movs r2, #1
- 3e668: fab3 f383 clz r3, r3
+ 3e666: fab3 f383 clz r3, r3
+ 3e66a: 2201 movs r2, #1
3e66c: bebe bkpt 0x00be
3e66e: 0320 lsls r0, r4, #12



This transformation should be always safe to do, especially as these two
instructions are unreachable and never executed. However on the i.MX6UL(L) it
alters the program flow and lets the program terminate successfully as if the
caches were off.

Placing hardware breakpoints to cover 0x9fe3e666-0x9fe3e666b has the effect
of "correcting" the runtime behavior and the binary runs to completion.
Neither breakpoints or watchpoints at this location are triggered however.

This issue is reproducible when invoking the binary from U-Boot
imx_v2017.03_4.9.88_2.0.0_ga, U-Boot 2019.12.0-rc2 as well as barebox v2019.07.0.
To avoid collision between binary load address and U-Boot reserved memory,
two U-Boot patches are attached. Apply the first on the uboot-imx and both on
upstream U-Boot. The used U-Boot config is mx6ull_14x14_evk_defconfig.

Outcomes