MPC5744P CPU crash under certain conditions, but unknown cause

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MPC5744P CPU crash under certain conditions, but unknown cause

1,636 Views
chosp
Contributor III

Hello

We are implementing an automotive application with the MPC5744P (MAPBGA257 package, 1N15P cut), which is integrated along with a MC33908 SBC. Compiler is S32DS 2.1 , -O1 optimization. We have been able to reproduce this in 3 different product units, so we hope it is not a faulty silicon. This has been the workflow so far:

The application development has been uneventful up to now. Adding a certain quantity of code causes, under certain conditions the core to hang completely.

- First and foremost, the SBC reset conditions are checked while reproducing the issue: Reset line does not toggle to low, which means that the SBC does not induce a reset due to abnormal function. So, we focus on the MPC core.

- The application has been instrumented with TRACE32 (Lauterbach Powerdebug Pro), and once the condition is reached, a DEBUG PORT FAIL condition arises. Using the reset detection capability, we are able to get to the trace information. The application crashes after a function returning a UINT8_T variable with value equal to 1. This translates to assigning 1 to R3, and then SE_BLR. 

- IVORs from 0 to 32 are properly instrumented with branch-to-self instructions (except IVOR4 which jumps to the ISR prologue), so no IVOR trap is taking place.

- After verifying that the problem can be reproduced in the same address, a breakpoint is set into that address to check system status before the crash. The system halts at the breakpoint. LR register is OK, stack is normal, everything OK. 

- Then, single execution steps are taken to reproduce the crash, with no success. The program steps successfully through the normal flow.

- But, if at any time, program flow is restored (no more stepping), the application crashes after some milliseconds (!!)

- Checking the trace dump again, we see that it is the same function but with different return value. In a different session, it can even be another different function.

- Adding a UINT8_T global spy variable to the function, changes the memory address the crash takes place.

- Changing a significative amount of code in a totally unrelated place of the application can eliminate that behavior.

 

As you see, it is very unreliable, and we have no clue of where to continue diagnosing this problem. Attaching code would be of little to no use, as the conditions to provoke this error are very particular, and the failure point, as you see, changes depending on unknown factors.

 

PS: Our problem is similar to this: https://community.nxp.com/t5/MPC5xxx/Lauterbach-reports-MPC5602D-is-in-state-running-reset-what-does...

except that we are able to trace the execution up to the failing point. SWT is disabled on startup, by the way. How is it that the CPU does not arrive to an IVORx exception?

 

Tags (3)
0 Kudos
9 Replies

1,385 Views
chosp
Contributor III

Due to the lack of response, should I open a support ticket?

0 Kudos

1,352 Views
davidtosenovjan
NXP TechSupport
NXP TechSupport

I am apologizing for not hearing from me. I am currently out of office due to family reason. Yes, please create a ticket for this if there is no other response. Thanks for understanding.

0 Kudos

1,338 Views
chosp
Contributor III

Hello, don't worry,I have opened a support ticked due to inactivity.

I hope we can come up with a solution soon because the application is not reliable right now, unless 100% code coverage testing is done after any minor software change (as minor as adding a 1-byte global debug variable). You will understand that this is not a bearable development workflow.

0 Kudos

1,461 Views
chosp
Contributor III

Hello

Do you have any other advice for this issue? 

Thank you

0 Kudos

1,597 Views
davidtosenovjan
NXP TechSupport
NXP TechSupport

Have you measured RESET signal bu scope? Has MCU been reset?

According your description you says debugger affect the behavior in the meaning the fault does no happen. Also basically you said behavior is unpredictable but repeatable.

I would at first check you clocking configuration. In the datasheet section 3.16.3 you have defined prescribed number of flash wait states.

Check 2.9.1 System clock frequency limitations

Also there is several errata that woul be needed top check:
ERR010639
ERR010640
ERR011073

https://www.nxp.com/docs/en/errata/MPC5744P_1N65H.pdf
https://www.nxp.com/docs/en/errata/MPC5744P_1N15P.pdf

0 Kudos

1,424 Views
chosp
Contributor III

Further information.

We have disabled manually DMA transfers and interrupts and provoked synthetically the issue (by manually setting the conditions for that). 

The crash still happens, so we can discard DMA and interrupt influence.

 

Also, if we put a hardware breakpoint in the conflicting line (where the crash happens), it breaks, and stepping is OK. But if we put a software breakpoint, the crash still happens (it does not break).

0 Kudos

1,446 Views
chosp
Contributor III

In further testing, I have found out that adding a SINGLE global variable to the system eliminates that crashing behavior.

What we can't assure at the moment is that the issue doesn't arise in other conditions.

I am really puzzled with this issue, and for the lack of answers I guess it's not a very common one.

Mostly, the fact that even the Lauterbach Powerdebug Pro system is unable to shed any light on this issue, providing any single bit of information on CPU registers in order to pinpoint the causing event.

0 Kudos

1,486 Views
chosp
Contributor III

The problem persists.

So far,

  • Flash timing has been adjusted in PFCR1 as per datasheet values.
  • MC_RGM DES and FES registers have been checked out, and no functional/destructive reset takes place.

We have been able to create conditions for a 100% reproducible failure. When stepping through the instruction before the crash (using a breakpoint), there is no crash. But if you enter in free-running mode, the system crashes, and checking the trace report, the crash happens in that instruction in which we breaked before.

Do you have any other idea we can try?

Thank you

0 Kudos

1,592 Views
chosp
Contributor III

Thank you for your response.

Have you measured RESET signal bu scope? Has MCU been reset?

RESET signal has been measured during the reproducible incident that provokes this failure. The RESET signal is always high, so no external RESET takes place.

According your description you says debugger affect the behavior in the meaning the fault does no happen. Also basically you said behavior is unpredictable but repeatable.

Yes, basically if you break into the exact point that the system hangs, you can step through instructions with success. However, when leaving stepping mode and running continuously, the system hangs into that exact point.

Also keep in mind that altering the software (adding variables, adding code) may, or may not, change the point where the system hang takes place. But once changed, this hang is 100% reproducible. 

 

I would at first check you clocking configuration. In the datasheet section 3.16.3 you have defined prescribed number of flash wait states.

Check 2.9.1 System clock frequency limitations

Also there is several errata that woul be needed top check:
ERR010639
ERR010640
ERR011073

https://www.nxp.com/docs/en/errata/MPC5744P_1N65H.pdf
https://www.nxp.com/docs/en/errata/MPC5744P_1N15P.pdf

The system has been working thousands of hours under different conditions and this condition has never showed up. Certain software revisions provoke this behavior. Which would be the relationship with the clocking configuration and the flash wait states? Just asking, to look into those hypothesis.

Regarding the clocking, this issue happens in a known situation long time after power up. The erratas you mention seem to be related to power up, right? 

The thing that puzzles me most, is that the system does NOT hang into any known exception handler. The debugger detects the core as halted/off/reset, and there is no visible symptom to this: right before the core hangs, the LR is OK, the Stack Pointer is well behind the stack size. 

I also checked the software version against the E200 branch checker software, and it is ok regarding that.

Checking out with my colleagues, we have found out that we previously have experienced this in the MPC5744P QFP144 version, years ago. So it is not just to this BGA257 1N15P part.

I don't know how to proceed, seriously.

 

I will look into your suggestions, but any additional hint is gratefully accepted!.

 

PS: The clock mode entry code, executed at startup:

 

/* Enable All Modes */
MC_ME.ME.R = 0x000005E2;

/* Peripheral ON in every run mode */
MC_ME.RUN_PC[0].R = 0x000000FE;

/******************** Configure XOSC for DRUN **********************/
/* Enable EXT OSC First */
XOSC.CTL.B.OSCM = 0x1; /* Change OSC mode to LCP (Loop Controlled Pierce Mode) */
XOSC.CTL.B.EOCV = 0x80; /* Set the End of Count Value for when to check stabilization. */

/* Enable XOSC in DRUN mode and select as SYS_CLK */
MC_ME.DRUN_MC.R = 0x00130031;

/* RE enter the DRUM mode, to update the configuration */
MC_ME.MCTL.R = 0x30005AF0; /* Mode & Key */
MC_ME.MCTL.R = 0x3000A50F; /* Mode & Key inverted */
while(MC_ME.GS.B.S_MTRANS == 1); /* Wait for mode entry to complete */
while(MC_ME.GS.B.S_CURRENT_MODE != 0x3); /* Check DRUN mode has been entered */
while(!MC_ME.GS.B.S_XOSC); /* Wait for clock to stabilise */

/******************** PLL0, PLL1 **********************/

/* Route XOSC to PLL1 */
MC_CGM.AC4_SC.B.SELCTL = 1;

/* Route XOSC PLL0 */
MC_CGM.AC3_SC.B.SELCTL = 1;

/*
Configure PLL0 Dividers for 160 MHz
fPLL0_VCO = (fPLL0_ref x PLL0DV[MFD] x 2)/PLL0DV[PREDIV]
= 40MHz x 8 x 2 / 1 = 640 MHz

fPLL0_PHI = fPLL0_ref x PLL0DV[MFD] / (PLL0DV[PREDIV] x PLL0DV[RFDPHI]
= 40MHz x 8 / (1 x 2) = 160 MHz

fPLL0_PHI1 = fPLL0_ref x PLL0DV[MFD] / (PLL0DV[PREDIV] x PLL0DV[RFDPHI1])
= 40MHz x 8 / (1 x = 40 MHz
*/
PLLDIG.PLL0DV.B.RFDPHI1 = 8;
PLLDIG.PLL0DV.B.RFDPHI = 2;
PLLDIG.PLL0DV.B.PREDIV = 1;
PLLDIG.PLL0DV.B.MFD = 8;

/*
Configure PLL1 Dividers for 200 MHz
fPLL1_VCO = fPLL1_REF x (PLL1DV[MFD] + PLL1FD[FRCDIV]/2^12)
= 40MHz x 20 + 0 = 800 MHz

fPLL1_PHI = fPLL1_REF * ( (PLL1DV[MFD] + PLL1FD[FRCDIV]/2^12) / (2 x PLL1DV[RFDPHI]) )
= 40MHz x (20 + 0) / (2 x 2) = 200 MHz

*/
PLLDIG.PLL1DV.B.RFDPHI = 2;
PLLDIG.PLL1DV.B.MFD = 20;

/* Enable PLL0/PLL1 in DRUN mode and set PLL1 as SYS_CLK */
MC_ME.DRUN_MC.R = 0x001300F4;

/******************** Configure Clock Dividers **********************/

SIUL2.MSCR[22].R = 0x22800001; /* Configure CLK_OUT (B6) */
//MC_CGM.AC6_SC.B.SELCTL = 0; /* source AC6 is internal RCOSC */
//MC_CGM.AC6_SC.B.SELCTL = 2; /* source AC6 is PLL0 PHI */
//MC_CGM.AC6_SC.B.SELCTL = 1; /* source AC6 is XOSC */
MC_CGM.AC6_SC.B.SELCTL = 4; /* source AC6 is PLL1 PHI */
MC_CGM.AC6_DC0.R = 0x80090000; /* Aux clock select 6 divider 0 --> div by 10 (CLK_OUT) */

MC_CGM.AC0_SC.B.SELCTL = 2; /* source AC0 is PLL0 PHI */
MC_CGM.AC0_DC0.R = 0x80000000; /* Aux clock select 0 divider 0 --> div by 1 (MOTC_CLK) */
MC_CGM.AC0_DC1.R = 0x80070000; /* Aux clock select 0 divider 1 --> div by 8 (SWG_CLK) */
MC_CGM.AC0_DC2.R = 0x80010000; /* Aux clock select 0 divider 2 --> div by 2 (ADC_CLK) */

MC_CGM.AC1_DC0.R = 0x80010000; /* Aux clock select 1 divider 0 --> div by 4 (FRAY_PLL_CLK) */
MC_CGM.AC1_DC1.R = 0x80030000; /* Aux clock select 1 divider 1 --> div by 4 (SENT_CLK) */

MC_CGM.AC2_DC0.R = 0x80030000; /* Aux clock select 2 divider 0 --> div by 4 (CAN_PLL_CLK) */

MC_CGM.SC_DC0.R = 0x80030000; //80070000 div by 8, 25mhz /* 80030000 Sys clock select divider 0 --> div by 4 (PBRIDGEx_CLK) -> DSPI clock = 50MHz, PIT clock = 50 MHz */

/******************** Start the core **********************/
/* Main and checker cores running in RUN3:0, DRUN, SAFE, TEST modes */
MC_ME.CCTL0.R = 0x00FE;

/* Set PRAM controller WS to 1 since running SYS_CLK as PLL1 (200 MHz) */
PRAMC.PRCR1.B.FT_DIS = 1;

/******************** Perform mode change **********************/
/* Mode change re-enter the DRUN mode, to start cores, clock tree & PLL1 */
MC_ME.MCTL.R = 0x30005AF0; /* Mode & Key */
MC_ME.MCTL.R = 0x3000A50F; /* Mode & Key inverted */

while(MC_ME.GS.B.S_MTRANS == 1); /* Wait for mode entry complete */
while(MC_ME.GS.B.S_CURRENT_MODE != 0x3); /* Check DRUN mode entered */

 

PS 2:

Can the PFCR1 register values affect to this hang issue? Before the hang, the software is not performing any FLASH write.

0 Kudos