Hello,
Working on a Vxworks BSP for a custom board that is based on the mpc8536rdk (but with 1GB memory) and we are getting ECC errors when system is coming up. The errors do not always happen but happen pretty much most of the time. When they don't happen system comes up fine. We are enabling ECC and waiting for DDR_SDRAM_CFG_2[D_INIT] to clear etc. and so Idont think it is due to memory not initialized correctly. Between runs the address reported (CAPTURE_ADDRESS) isnt the same. Also when I stop at the DDR exception handler, looking at the CAPTURE_ADDRESS, CAPTURE_DATA_HI and CAPTURE_DATA_LO, it looks like the memory at that address is what it ought to be (it is code, and i can compare the bytes against the disassembly dump).
The memory is a Micron MT47H128M8CF-25E IT:H. We are reusing almost all of the DDR memory settings from MPC8536RDK Uboot's code (except for the fact it is 1GB and thus some differences to CS0_BNDS and CS0_CONFIG - there is only one chip select used). I have also consulted at application note 3369,
Am wondering where the underlying cause would be that would cause this kind of a symptom.
Thanks!
Are you initializing DDR with known values? DDR_DATA_INIT could = 0x0, or something like 0xDEAD_BEEF, and you need to make sure you have DDR_SDRAM_CFG2[D_INIT]=1.
This will cause the controller to initialize all memory and all syndrome bits to a known value before using memory. If you don't do this - the syndrome bits aren't initialized and it'll give you an error.
Yes. DDR_DATA_INIT as 0, and D_INIT is 1, and wait for DDR_SDRAM_CFG2[D_INIT} to become 0. The ECC errors dont also seem to happen, on immediate access to memory i.e. you have the above initialization, then code that copies a payload to RAM, and then executes - we seem to get these errors well into the execution from RAM (and as mentioned above, sometimes the errors don't happen and system comes up fine). So I dont think it is memory not being initialized correctly.
Interesting - what sort of errors are you seeing in ERR_DETECT. Single bit or multi bit?
Is this always shortly after you jump to DDR (or start using DDR)?I wonder if this could be related to synchronization and or caching. When you set up the MMU, in rominit.s I assume you set it up once as DDR is cachable. In that case the writes out to ECC are likely delayed.
One test could be to:
set up DDR as non-cacheable
initialize DDR
change MMU entry to allow caching of DDR
Does that change the behavior?
Have you tried to initialize just via Workbench? (using a reg file to initialize memory?) Does that work? Can you initialize memory and then do a memory test?
No not immediately. I added printfs of DDR_ERR_DETECT register (write-to-clear in romInit.s), at various spots and it doesnt get set in sysHwInit2 - i.e early but still enough into RAM code. However, I have seen it (albeit rarer) at start of sysHwInit itself which again is not immediately in RAM code but earlier.
I didnt have TLB entries for DDR when setting it up. I just added non-cacheable entries now and tried again. It didnt make a difference i.e. DDR_ERR_DETECT did set again at various spots. (the frequency seemed reduced, but that is usually a red-herring - I have seen it become more-frequent and less-frequent at different times).
I had also written a ram test, where (after explicitly turning off L2 cache) in a loop (for each 32-bit address spanning entire memory range) I write a pattern, read back, and restore, and look for DDR errors (after each iteration). Running this as soon as romInit.s rarely (have seen it exactly once in almost 100 times) fails. Running it later seems to fail more often. But again it seems too vary enough in frequency for me to get a good handle on it.
I dont know if there is something in the hardware that could be causing it also. At this point, I am trying to eliminate software as a cause.
The errors I am seeing are single-bit. The customer one of his boards, has seen multi-bit (regularly when it shows up).
If you send me a few errors (i.e. errors you capture in the err_capture registers) I can look at it and tell you what bits are getting flipped. If it's the same bit, maybe it's a HW issue.
Thanks! Here are 3 such errors. When error happens in one address, it seems to occurs a few (3-4) times) - albeit not on consecutive runs.
ERR_DETECT: | 0x80000004 | |
CAPTURE_ADDRESS: | 0x001064e0 | |
CAPTURE_DATA_HI: | 0x419e0090 | |
CAPTURE_DATA_LO: | 0x480000e4 | |
CAPTURE_ATTRIBUTES: 0x00102001 | ||
CAPTURE_ECC: | 0xdededede |
Expected (same as above) Contents of 0x1064e0 (from objdump of elf file)
1064e0: 41 9e 00 90 beq- cr7,106570
1064e4: 48 00 00 e4 b 1065c8
ERR_DETECT: | 0x80000004 | |
CAPTURE_ADDRESS: | 0x00137a00 | |
CAPTURE_DATA_HI: | 0x7c800124 | |
CAPTURE_DATA_LO: | 0x4c00012c | |
CAPTURE_ATTRIBUTES: 0x00102001 | ||
CAPTURE_ECC: | 0xfdfdfdfd |
Expected (same as above) Contents of 0x137a00 (from objdump of elf file)
137a00: 7c 80 01 24 mtmsr r4
137a04: 4c 00 01 2c isync
ERR_DETECT = 0x80000004 | ||
CAPTURE_ADDRESS: | 0x00137a40 | |
CAPTURE_DATA_HI: | 0x2f800000 | |
CAPTURE_DATA_LO: | 0x419e0010 | |
CAPTURE_ATTRIBUTES: 0x00102001 | ||
CAPTURE_ECC: | 0x0b0b0b0b |
Expected (same as above) Contents of 0x137a40 (from objdump of elf file)
137a40: 2f 80 00 00 cmpwi cr7,r0,0
137a44: 41 9e 00 10 beq- cr7,137a54
Doesn't seem like a pattern:
First one is bit 58
2nd is syndrome ecc3
3rd is syndrome ecc4
I'll send you the tool to check this yourself.
Hello @genuap , can you please send me the tool too? I'm fighting 2 bits that like to flip