imx8m nano 1GB DDR4 linux memory errors

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

imx8m nano 1GB DDR4 linux memory errors

1,847 Views
andreas2
Contributor I

Hi,

We have a custom board which is based on the 8mx8mn-ddr4-evk board, with some modifications - among them a 1GB RAM (https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/8gb_ddr4_sdram.... MT40A512M16LY-075:E - RPA settings attached) which is different than on the EVK. In earlier revisions of the board we have been using a 2GB RAM with no issues, but after changing to 1GB we are seeing lots of random linux crashes which seems to be memory related and pretty much random in nature (ram corruption of some kind). I've used the DDR tool to generate a new ddr4_timing.c for u-boot, and the stress tests have been running for 2 days with no issues. Same in u-boot; running the mtest command over the whole RAM (except the area occupied by u-boot itself) there are no issues found. 

I've had to disable optee support for the board, as it will always report an incorrect RAM size of 2GB to linux no matter what I changed the PHYS_SDRAM_SIZE define to be. After disabling optee u-boot correctlty detects 1GB, and passes this to linux.

All issues observed have been in linux, so I am really struggling to find what the issue is. We are running a version forked from imx linux 5.4.3-2.0.0 (u-boot 2009.04-5.4.3-2.0.0) What I have done is:

yocto:

remove "optee" from board config

u-boot:

- Update ddr4_timing.c with new data from mscale ddr tool. Two frequency points (1200MHz and 533MHz, same as DDR4 EVK). Stress test with no issues for 48 hours. 

- Change PHYS_SDRAM_SIZE to 0x40000000 (1GB)

- kept  CONFIG_NR_DRAM_BANKS at 2 (unsure if this should be 1 or 2; hard to find good documenation about what this actually configures.. ) Have tried different options here, issue still persists.

linux:

- Reduced cma size in imx8mn.dtsi (included from our DTS)

reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;

/* global autoconfigured region for contiguous allocations */
linux,cma {
compatible = "shared-dma-pool";
reusable;
size = <0 0x200000>;
alloc-ranges = <0 0x40000000 0 0x40000000>;
linux,cma-default;
};
};

- Reduced memory node size in imx8mn.dtsi

memory@40000000 {
device_type = "memory";
reg = <0x0 0x40000000 0 0x40000000>;
};

- Moved RPmsg shared memory to start of first gigabyte:

reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;
rpmsg_reserved: rpmsg@0x40000000 {
no-map;
reg = <0 0x40000000 0 0x400000>;
};
};
};

&rpmsg{
/*
* 64K for one rpmsg instance:
* --0xb8000000~0xb800ffff: pingpong
*/
vdev-nums = <1>;
reg = <0x0 0x40000000 0x0 0x10000>;
status = "okay";
};

Running stress-ng on the board will produce output similar to this:

stress-ng --vm 1 --vm-bytes 75% --vm-method all --verify -t 10m -v
stress-ng: debug: [489] 4 processors online, 4 processors configured
stress-ng: info: [489] dispatching hogs: 1 vm
stress-ng: info: [489] cache allocate: using defaults, can't determine cache details from sysfs
stress-ng: debug: [489] cache allocate: default cache size: 2048K
stress-ng: debug: [489] starting stressors
stress-ng: debug: [489] 1 stressor spawned
stress-ng: debug: [490] stress-ng-vm: started [490] (instance 0)
stress-ng: debug: [490] stress-ng-vm using method 'all'
stress-ng: fail: [491] flip: detected 247350 memory errors
[ 99.758827] ------------[ cut here ]------------
[ 99.763470] kernel BUG at arch/arm64/kernel/traps.c:405!
[ 99.768782] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 99.774268] Modules linked in: crct10dif_ce brcmfmac brcmutil snd_soc_dg_tlv320adcx140_b snd_soc_dg_tlv320adcx140_a snd_soc_pcm1774 rpmsg_char
[ 99.787062] CPU: 2 PID: 491 Comm: stress-ng-vm Not tainted 5.4.3-2.0.0+g38124af5f1d1 #1
[ 99.795062] Hardware name: CUT (DT)
[ 99.799939] pstate: 00000005 (nzcv daif -PAN -UAO)
[ 99.804739] pc : do_undefinstr+0x2e4/0x308
[ 99.808833] lr : do_undefinstr+0x1d8/0x308
[ 99.812926] sp : ffff800012113b10
[ 99.816237] x29: ffff800012113b10 x28: ffff0000224a8000
[ 99.821547] x27: 0000000000000002 x26: fffffe000069bb80
[ 99.826857] x25: ffff000027c63d80 x24: 0000000000000000
[ 99.832167] x23: 0000000060000005 x22: ffff8000101f7cd8
[ 99.837478] x21: ffff800012113cb0 x20: ffff0000224a8000
[ 99.842787] x19: ffff800012113b70 x18: 0000000000000000
[ 99.848097] x17: 0000000000000000 x16: 0000000000000000
[ 99.853406] x15: 0000000000000000 x14: 0000000000000000
[ 99.858716] x13: 0000000000000000 x12: 0000000000000000
[ 99.864026] x11: 0000000000000000 x10: ffff800012113bb0
[ 99.869336] x9 : 0000000000000001 x8 : ffff000022045230
[ 99.874646] x7 : 0000000000001240 x6 : ffff800012113b68
[ 99.879955] x5 : 00000000d5300000 x4 : ffff800011a449b0
[ 99.885265] x3 : 0000000049b00000 x2 : 0000000000000000
[ 99.890575] x1 : ffff0000224a8000 x0 : 0000000060000005
[ 99.895887] Call trace:
[ 99.898334] do_undefinstr+0x2e4/0x308
[ 99.902081] el1_undef+0x10/0x84
[ 99.905311] clear_huge_page+0x0/0x210
[ 99.909059] __handle_mm_fault+0x848/0x10b8
[ 99.913240] handle_mm_fault+0xdc/0x1a8
[ 99.917078] do_page_fault+0x130/0x460
[ 99.920826] do_translation_fault+0x5c/0x78
[ 99.925006] do_mem_abort+0x3c/0x98
[ 99.928492] el0_da+0x1c/0x20
[ 99.931462] Code: f9401bf7 17ffff7d a9025bf5 f9001bf7 (d4210000)
[ 99.937556] ---[ end trace eeb5c12cb1e867a2 ]---
[ 99.942175] note: stress-ng-vm[491] exited with preempt_count 1
[ 99.948342] ------------[ cut here ]------------

 This is just one output. the type of kernel crashed varies quite a lot, so something is getting corrupted somewhere. 

Any input on how to debug or solve this would be appreciated. As far as I can see from the porting guide I have done all the modifications that are needed. Also running both the stress test and memory test in u-boot is successful, so I doubt that there is a hardware issue with the RAM. Unfortunately we cut the JTAG connector on this revision, so I don't have an easy way to do jtag debugging of the board.

Labels (1)
0 Kudos
Reply
3 Replies

1,835 Views
igorpadykov
NXP Employee
NXP Employee

Hi Andreas

 

in attached rpa.png Total number of BANKS=8, seems it should be 4.

 

>We are running a version forked from imx linux 5.4.3-2.0.0 (u-boot 2009.04-5.4.3-2.0.0)

 

seems you are using non-nxp uboot (not from https://source.codeaurora.org/external/imx/uboot-imx

repository).

For linux 5.4.3-2.0.0 should be used uboot v2019.04 :

https://source.codeaurora.org/external/imx/uboot-imx/tree/?h=imx_v2019.04_5.4.3_2.0.0

 

In general recommended to use latest NXP L5.10.35_2.0.0 linux/uboot (from

source.codeaurora.org/external/imx repository) :

https://source.codeaurora.org/external/imx/uboot-imx/tree/?h=lf_v2021.04

https://source.codeaurora.org/external/imx/linux-imx/tree/?h=lf-5.10.y

 

>kept CONFIG_NR_DRAM_BANKS at 2 (unsure if this should be 1 or 2; hard to find good
>documenation about what this actually configures.. ) Have tried different options here, issue still persists.

 

there is no such "CONFIG_NR_DRAM_BANKS" settings in nxp uboot v2019.04_5.4.3_2.0.0 :

https://source.codeaurora.org/external/imx/uboot-imx/tree/include/configs/imx8mn_evk.h?h=imx_v2019.0...

https://source.codeaurora.org/external/imx/uboot-imx/tree/include/configs/imx8mn_evk.h?h=lf_v2021.04

 

Best regards
igor

0 Kudos
Reply

1,828 Views
andreas2
Contributor I

Hi Igor,

As far as I can see from the datasheet the number of banks are correct. There are 2 bank groups with 4 banks in each group, but I might have misunerstood this ? 

 

We are using the nxp u-boot; the only modifications we are using are the ones to add our own board. The version string I quoted is the one printed on boot.

 

https://source.codeaurora.org/external/imx/uboot-imx/tree/configs/imx8mn_ddr4_evk_defconfig?h=lf_v20... has the CONFIG_NR_DRAM_BANKS=2 define.

 

 

0 Kudos
Reply

1,818 Views
igorpadykov
NXP Employee
NXP Employee

Hi Andreas

 

I asked internally, below answer from team:

-----------------------

I let run board for a few hours and no crash was observed with the same settings the customer provide us.

Regarding debugging of kernel crashes one can try to use kdump and find out more from a dump. [1]

 

[1] http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html

-----------------------

Best regards
igor

0 Kudos
Reply