P5020 Multibit ECC

timothypark · ‎03-09-2017

Hello,

I am trying to test the multi-bit ECC capability on the P5020, however, I seem to be running into an issue where I hit multiple machine check errors after the first multi-bit ecc is handled. I have the following set up to inject the multi-bit ecc and also to handle the machine check exception to follow:

DDR_ERR_DISABLE = 0

DDR_SDRAM_CFG[ECC_EN] = 1

DDR_ERR_INJECT_LO = 0x00000003

DDR_ERR_INJECT[EIEN] = 1

IVOR1 is set to handle the machine check exception by rolling up MCSRR0 to the next instruction once the rfmci instruction is called.

After the first multi-bit ecc is handled the injection is turned off as such: DDR_ERR_INJECT[EIEN] = 0.

It is after this where every instruction step will trigger a multi-bit error and it keeps vectoring off to IVOR1.

Please let me know what else I should be doing.

Thanks,

Tim

timothypark · ‎03-10-2017

Hello Bulat,

Thanks for your reply.

Well, yes, the machine check error does prove that the DDR controller detected the multi-bit ecc that I wanted. However, it is after turning off the ECC injector that my problem arises. I keep running into multi-bit ecc errors and vectoring off to IVOR1 after each instruction step. I know I keep running into multi-bit ecc, even after turning off the injector, because the DDR_ERR_DETECT[MBE] gets set every time I clear it with each instruction step.

Am I doing something wrong or am I miss understanding about multi-bit ecc and this is just the proper operation after a multi-bit ecc has been detected?

Thanks,

Tim

Bulat · ‎03-13-2017

It is difficult to guess what is really happening, problem can relate to particular configuration (like cache on/off). Do you read 'DDR_CAPTURE_n' registers of the DDR controller in the interrupt handler? Do you reset error bits in the ERR_DETECT register?

Regards,

Bulat

timothypark · ‎03-13-2017

During initialization the TLB for the DDR is set so that caching is inhibited, so I do not believe it to be a cache issue. Unless there is something else I have to set to disallow DDR to be cached.

Within the interrupt handler, reading and clearing the DDR_CAPTURE_n registers doesn't seem to help. I do reset the ERR_DETECT register within the interrupt handler so that it is clear when it returns back to normal code. However, it get set again with the new unexpected multi-bit error. I should also mention that the DDR_CAPTURE_n does get set to the new multi-bit error.

Thanks,

Tim

Bulat · ‎03-14-2017

So you can see a lot of different addresses reported via DDR_CAPTURE_ADDRESS. Do these address values look reasonable?

Are interrupts enabled in the DDRx_ERR_INT_EN register? If yes, try to disable it.

Regards,

Bulat

timothypark · ‎03-16-2017

Sorry for the late reply, was busy with another task for a bit. I believe that the address within the DDR_CAPTURE_ADDRESS to be reasonable, but I will have to confirm with you if there is something odd. All I would see is that the address captured there were similar to the address of the instruction location.

I've tried multiple of different settings, including disabling DDRx_ERR_INT_EN for multi-bit (i.e. MBEE is set to 0) and also enabling it. All of this and still get the same problem of multi-bit error after turning off the injector.

Thank you for taking the time helping me investigate this.

Tim

Bulat · ‎03-17-2017

I can not know what happens on your board. However something really looks odd.

You mentioned following setting describing test setup:

DDR_SDRAM_CFG[ECC_EN] = 1

Is that done in the test or during memory initialization?

Can you provide values of all DDR_CAPTURE registers and MCSRR0 during first MCE?

The same information during second MCE?

Regards,

Bulat

timothypark · ‎03-20-2017

Yes, the board is acting odd. I'm sure it is something that I haven't or have set up wrong, but I'm unsure what it is. Thanks again for working with me.

DDR_SDRAM_CFG[ECC_EN] = 1 is done during memory initialization, we are using Code warrior and its TCL script to set up the board for our testing.

Here is the registers before:

CAPTURE_DATA_HI: 0x919a0e08

CAPTURE_DATA_LO: 0x3b20fffc

CAPTURE_ECC: 0x49494949

CAPTURE_ATTRIBUTES: 0x10802001

CAPTURE_ADDRESS: 0x000050d0

MCSRR0: 0x000050d4

MCSRR1: 0x2000

Here is the registers after:

CAPTURE_DATA_HI: 0x919a0e08

CAPTURE_DATA_LO: 0x3b20fffc

CAPTURE_ECC: 0x49494949

CAPTURE_ATTRIBUTES: 0x12492001

CAPTURE_ADDRESS: 0x000050d0

MCSRR0: 0x000050d8

MCSRR1: 0x2000

Thanks,

Tim

Bulat · ‎03-21-2017

Can you take memory dump that includes address 0x000050d0 before errors are injected? How different the data in the memory and in the CAPTURE_DATA registers? Does this correspond to the DDR_ERR_INJECT_LO = 0x00000003 mask?

Also it is not clear how the code can be affected by the error injection... What are you doing exactly after DDR_ERR_INJECT[EIEN] = 1?

Regards,

Bulat

timothypark · ‎03-21-2017

So interesting things happen when I do the memory dump before and after the first MCE. Here is the memory dump before the first MCE:

0x000050c0: 0x7d9cd830 0x81990e04 0x7d8ce378 0x91990e04

0x000050d0: 0x81990e08 0x618c0100 0x91990e08 0x3980ffff

0x000050e0: 0x919f0000 0x81990e08 0x558c062c 0x91990e08

Here is the memory dump after coming back from the first MCE

0x000050c0: 0x7d9cd830 0x81990e04 0x7d8ce378 0x91990e04

0x000050d0: 0x7d9cd830 0x81990e04 0x7d8ce378 0x91990e04

0x000050e0: 0x7d9cd830 0x81990e04 0x7d8ce378 0x91990e04

As you can see, addresses from 0x50d0 - 0x50ec get changed to almost match what is written in address ranges 0x50c0 - 0x50cc. These data that are located at these address ranges are the instructions for the code that tests multi-bit ECC.

They get changed just as I do a write of data into global data, which in this case the injector should kick in and alter the ECC for that global data.

Here are the steps just after enabling injection (i.e. DDR_ERR_INJECT[EIEN] = 1):

1) Write of 0xFFFFFFFF to a global variable

2) Read from global variable to a local variable

3) Disable error injection.

Thanks,

Tim

Bulat · ‎03-22-2017

The chaos around instructions does not look like effect of the error injection.

Can you slightly change your sequence like following:

1) Write of 0xFFFFFFFF to a global variable

2) Disable error injection.

3) Read DDR_ERR_INJECT to be sure it is written

4) Read from global variable to a local variable

Regards,

Bulat

timothypark · ‎03-22-2017

Still wonder why those instructions are being corrupted though when ECC is turned on.

I remember trying those steps in my attempt on figuring out multi-bit ECC and it not being fruitful, but I will try it again and get back to you.

Thanks,

Tim

timothypark · ‎03-22-2017

Sadly, still getting the same result where the instructions are being corrupted after a MCE when following the steps you recommended.

Thanks,

Tim

Bulat · ‎03-24-2017

Can you check HID0[EMCP] value before test starts?

timothypark · ‎03-27-2017

It is set, HID0[EMCP] = 1. Also, MSR[ME] = 0 before test start too.

Bulat · ‎03-28-2017

Is it correct: MSR[ME] = 0?

As I wrote, it is really difficult to guess what is happening, however it looks like your MCE events do not relate to ECC errors as it is supposed to be. I think you need to simplify your test to get positive result. This means following: as soon as you read a faulty word, the processor should immediately capture the address of the word in the CAPTURE_ADDRESS register. Not the address of the instruction that performs the read. I believe this can be done using step-by-step execution using a debugger.

Regards,

Bulat

Bulat · ‎03-10-2017

I believe your test proves that multi-bit ecc errors are detected by the memory controller. Or the purpose of your test is different?

Regards,

Bulat

P5020 Multibit ECC

P5020 Multibit ECC

QorIQ P5 Devices