mpc8536e - getting ecc errors (post initialization)

admin · ‎09-10-2012

Hello,

Working on a Vxworks BSP for a custom board that is based on the mpc8536rdk (but with 1GB memory) and we are getting ECC errors when system is coming up. The errors do not always happen but happen pretty much most of the time. When they don't happen system comes up fine. We are enabling ECC and waiting for DDR_SDRAM_CFG_2[D_INIT] to clear etc. and so Idont think it is due to memory not initialized correctly. Between runs the address reported (CAPTURE_ADDRESS) isnt the same. Also when I stop at the DDR exception handler, looking at the CAPTURE_ADDRESS, CAPTURE_DATA_HI and CAPTURE_DATA_LO, it looks like the memory at that address is what it ought to be (it is code, and i can compare the bytes against the disassembly dump).

The memory is a Micron MT47H128M8CF-25E IT:H. We are reusing almost all of the DDR memory settings from MPC8536RDK Uboot's code (except for the fact it is 1GB and thus some differences to CS0_BNDS and CS0_CONFIG - there is only one chip select used). I have also consulted at application note 3369,

Am wondering where the underlying cause would be that would cause this kind of a symptom.

Thanks!

genuap · ‎09-12-2012

Are you initializing DDR with known values? DDR_DATA_INIT could = 0x0, or something like 0xDEAD_BEEF, and you need to make sure you have DDR_SDRAM_CFG2[D_INIT]=1.

This will cause the controller to initialize all memory and all syndrome bits to a known value before using memory. If you don't do this - the syndrome bits aren't initialized and it'll give you an error.

admin · ‎09-12-2012

Yes. DDR_DATA_INIT as 0, and D_INIT is 1, and wait for DDR_SDRAM_CFG2[D_INIT} to become 0. The ECC errors dont also seem to happen, on immediate access to memory i.e. you have the above initialization, then code that copies a payload to RAM, and then executes - we seem to get these errors well into the execution from RAM (and as mentioned above, sometimes the errors don't happen and system comes up fine). So I dont think it is memory not being initialized correctly.

genuap · ‎09-12-2012

Interesting - what sort of errors are you seeing in ERR_DETECT. Single bit or multi bit?

Is this always shortly after you jump to DDR (or start using DDR)?I wonder if this could be related to synchronization and or caching. When you set up the MMU, in rominit.s I assume you set it up once as DDR is cachable. In that case the writes out to ECC are likely delayed.

One test could be to:

set up DDR as non-cacheable

initialize DDR

change MMU entry to allow caching of DDR

Does that change the behavior?

Have you tried to initialize just via Workbench? (using a reg file to initialize memory?) Does that work? Can you initialize memory and then do a memory test?

admin · ‎09-12-2012

No not immediately. I added printfs of DDR_ERR_DETECT register (write-to-clear in romInit.s), at various spots and it doesnt get set in sysHwInit2 - i.e early but still enough into RAM code. However, I have seen it (albeit rarer) at start of sysHwInit itself which again is not immediately in RAM code but earlier.

I didnt have TLB entries for DDR when setting it up. I just added non-cacheable entries now and tried again. It didnt make a difference i.e. DDR_ERR_DETECT did set again at various spots. (the frequency seemed reduced, but that is usually a red-herring - I have seen it become more-frequent and less-frequent at different times).

I had also written a ram test, where (after explicitly turning off L2 cache) in a loop (for each 32-bit address spanning entire memory range) I write a pattern, read back, and restore, and look for DDR errors (after each iteration). Running this as soon as romInit.s rarely (have seen it exactly once in almost 100 times) fails. Running it later seems to fail more often. But again it seems too vary enough in frequency for me to get a good handle on it.

I dont know if there is something in the hardware that could be causing it also. At this point, I am trying to eliminate software as a cause.

admin · ‎09-12-2012

The errors I am seeing are single-bit. The customer one of his boards, has seen multi-bit (regularly when it shows up).

genuap · ‎09-12-2012

If you send me a few errors (i.e. errors you capture in the err_capture registers) I can look at it and tell you what bits are getting flipped. If it's the same bit, maybe it's a HW issue.

admin · ‎09-12-2012

Thanks! Here are 3 such errors. When error happens in one address, it seems to occurs a few (3-4) times) - albeit not on consecutive runs.

	ERR_DETECT:	0x80000004
	CAPTURE_ADDRESS:	0x001064e0
	CAPTURE_DATA_HI:	0x419e0090
	CAPTURE_DATA_LO:	0x480000e4
	CAPTURE_ATTRIBUTES: 0x00102001
	CAPTURE_ECC:	0xdededede

Expected (same as above) Contents of 0x1064e0 (from objdump of elf file)

1064e0: 41 9e 00 90 beq- cr7,106570

1064e4: 48 00 00 e4 b 1065c8

	ERR_DETECT:	0x80000004
	CAPTURE_ADDRESS:	0x00137a00
	CAPTURE_DATA_HI:	0x7c800124
	CAPTURE_DATA_LO:	0x4c00012c
	CAPTURE_ATTRIBUTES: 0x00102001
	CAPTURE_ECC:	0xfdfdfdfd

Expected (same as above) Contents of 0x137a00 (from objdump of elf file)

137a00: 7c 80 01 24 mtmsr r4

137a04: 4c 00 01 2c isync

	ERR_DETECT = 0x80000004
	CAPTURE_ADDRESS:	0x00137a40
	CAPTURE_DATA_HI:	0x2f800000
	CAPTURE_DATA_LO:	0x419e0010
	CAPTURE_ATTRIBUTES: 0x00102001
	CAPTURE_ECC:	0x0b0b0b0b

Expected (same as above) Contents of 0x137a40 (from objdump of elf file)

137a40: 2f 80 00 00 cmpwi cr7,r0,0

137a44: 41 9e 00 10 beq- cr7,137a54

genuap · ‎09-12-2012

Doesn't seem like a pattern:

First one is bit 58

2nd is syndrome ecc3

3rd is syndrome ecc4

I'll send you the tool to check this yourself.

dubk · ‎03-04-2022

Hello @genuap , can you please send me the tool too? I'm fighting 2 bits that like to flip