timing_pal S32SDK_for_S32K1xx_RTM_3.0.3 deadlock bug

jeremie_chirat · ‎08-30-2022

Hi,

After a painstaking analysis of random crashes (only 2-3 every 12 hours) we had on our S32K148-based board (we have an external watchdog so a deadlock causes a reboot of the board), we followed some advice from the NXP forums, and we managed to do a debug session by disabling the watchdog and did an "attach running" after a crash.

We confirmed the culprit: the board got stuck inside the FTMX_ChX_ChX_IrqHandler :

Example: FTM0_Ch0_Ch1_IrqHandler line 287:

if (chan0IntFlag && g_ftmChannelRunning[0][0])
{
TIMING_Ftm_IrqHandler(0U, 0U);
}

The check on the running channel never passed, so the handler was never executed correctly, and as a result the interrupt was never disabled and run continuously.

Our system use an RTOS (SafeRTOS based on FreeRTOS). After analysis we saw that in timing_pal.c : TIMING_StartChannel() line 769, the interrupt is enabled before the channel running is updated:

/* Enable the channel by enable interrupt generation */
retVal = FTM_DRV_EnableInterrupts(ftmInstance, (1UL << channel));
(void)retVal;
/* Update channel running status */
g_ftmChannelRunning[ftmInstance][channel] = true;

Since the timing requested is 500us (we needed a timer since the time is less than 1 tick (1ms)), if by some random chance the task gets interrupted between the interrupt enable and the channel status update and the system takes too long to return, then it results in a deadlock (what happened to us).

I think to solve this issue the order of the actions should be inverted (channel status update first then as the last action the interrupt enabling).

Since we did not want to modify the SDK the workaround we used is to start a critical section before the call to TIMING_StartChannel() and stop it just after, which we confirmed fixed the crashes after long run tests.

Can you confirm the bug ?

Best regards,

Jérémie Chirat

cuongnguyenphu · ‎09-07-2022

Hi @jeremie_chirat,
Thanks for your hard work. I confirm this is a potential bug in SDK, it's still in our latest version S32_SDK_S32K1xx_RTM_4.0.3
I raised this issue with your solution for our development team and asked for the workaround. It looks a great idea to update channel status first before the interrupt enabling

在原帖中查看解决方案

cuongnguyenphu · ‎09-07-2022

Hi @jeremie_chirat,
Thanks for your hard work. I confirm this is a potential bug in SDK, it's still in our latest version S32_SDK_S32K1xx_RTM_4.0.3
I raised this issue with your solution for our development team and asked for the workaround. It looks a great idea to update channel status first before the interrupt enabling

jeremie_chirat · ‎09-07-2022

Thank you for your answer, glad that what I did can be useful.