How to tackle this very strange and difficult problem...

simmania

Hi all. I have a very strange problem and hope someone can give me advice how to tackle this because I'm working on this for days now and I just don't understand what is going wrong.

I'm working with MCUXpresso and a MIMXRT1010-EVK.

I have an ADC that generates an interrupt for each sample. And the sample rate is 288KHz!

I do some processing in this interrupt depending on a global variable. By default the global flag is low and I do little processing.
I set an output pin high during the interrupt. And on the oscilloscope I can see that the interrupt indeed occurs at 288KHz rate and that it only takes a fraction of a micro second each interrupt. About 8% processor load.

In the main function a command loop is running. With this I can change the global variable and I can slowly increase the interrupt load. And I can see on the oscilloscope that the load for the interrupts goes up. Up to 80% or even 90% is normally no problem. The main command loops keeps running.

The problem:

Some times after a new compile the application starts correct but when increasing the global variable and thus the interrupt load the main loop stops suddenly. But the interrupt keeps working. It always seems to happens at about the same interrupt load (about 40%).

And a next time I compile and run (after adding some code that has nothing to do with the interrupt) there is no problem at all. And the main loop keeps running while increasing the interrupt load. Does it depends on the position of code or variables in memory?

Any idea what could cause this and how to tackle this? I'm out of ideas.

In the interrupt routine I only read the ADC value and do some calculations. Then I write a global variable.

simmania

I think I found the problem!
I made a new thread about it because I think it can be very valuable information for developers working with high rate interrupts.

See this thread.

The problem seems to be that for the execution of the main loop external code reads are needed which are aborted again and again by some interrupt.

View solution in original post

mayliu1

Hi @simmania ,

Thank you so much for your interest in our products and for using our community.

If you use MCUXPresso IDE and bare-metal program, I suggest you can check Heap and Stack Usage when problem happened. Please also pay attention to the registers status of Registers, Faults, Peripheral+.

Please check whether something wrong.

what you did "In the interrupt routine I only read the ADC value and do some calculations. Then I write a global variable", I suggest you do not do any calculations in interrupt routine, you can do it in your main function.

Wish it helps you.
If you still have question about it, please kindly let me know.

Best Regards

mayliu

simmania

Hello Mayliu1,

Thanks for your reply.

One of the strange things is that when the problem occurs and I pause the MCU, I see nothing strange. And when resuming the MCU again, it just works again!!

Note that the problem also occurs without the debugger. So it is not a problem with the debugger.

And about your remark about the calculations in the interrupt. I need to do some calculations in the interrupt, that is just what I need.
And currently my interrupt routine just does some dummy stuff to create some load to show the problem.

Do you have a MIMXRT1010-EVK board? I could send the project so that you can see the strange behavior yourself.

mayliu1

Hi @simmania ,

Thanks for your reply.

I suggest you disable all other module function, only ADC module work, increasing the interrupt load to check whether project work normal.

And also, I have a MIMXRT1010-EVK board, you can send the project to me, I will test it.

Best Regards

mayliu

simmania

Hi @mayliu1 , did you have a change to test it? I really do not have any clue why this is happening and I'm currently out of ideas.

mayliu1

Hi @simmania ,

Thanks for your reply.

I downloaded your project and debugged it. It ran normally without any stuck.

I found that g_delay was fixed set as 20 no matter what value I input.

So I changed it to 60 as you suggested, the project still run without stuck.

So could you try following methods.

1: set a larger Stack and heap size.

2: Please do not do any calculations or delay in ADC interrupt handler(ADC1_IRQHANDLER).

It's not advisable to conduct floating-point calculations in the ADC interrupt(ADC1_IRQHANDLER). This kind of practice is not rational.

Wish it helps you.
If you still have question about it, please kindly let me know.

Best Regards

mayliu

simmania

@mayliu1, when you start the demo it should say "Hit key for command ... " on the terminal. Then you can enter a number from 1 to 9 which will define the interrupt load. Then it flashes the LED 20 times and sets back the load to 20 and you can enter an other number.
You should see the LED flashing after pressing a number and the flash frequency also depends on the load (selected digit). It this is not the cases, the demo does not work.

Note that only setting the g_delay does not work because the g_mode needs to be set too. Which is done by the main loop too.

Please try again, because I think there is seriously some thing wrong.

I tried to increase the stack size, but that did not help.

And, yes I know, one should not do many calculation in an interrupt routine. But this is what the application is about. Can not change that.
And anyway, I use an artificial interrupt load to make the problem visible easily.

simmania

And note that it does not seem to go wrong when you first enter some low digits (low interrupt load) after starting. Then for some reason the high load does work ok.
But when the first digit you press after restart is 8 or 9, it fails.

mayliu1

Hi @simmania ,

Thanks for your reply.

I did a validation on my side.

1: I change PIT timer start function time sequence and I disable a delay function.

2: I add some code in ADC interrupt handler.

ADC_ETC_ClearInterruptStatusFlags

SDK_ISR_EXIT_BARRIER;

Wish it helps you.
If you still have question about it, please kindly let me know.

Best Regards

mayliu

simmania

I was able to narrow it down to a very simple program.

It has an ADC interrupt routine that is called at 288KHz. This routine does some dummy stuff depending on a global variable. So the load can be changed. Note that the load can also be monitored at pin 16 of J56.

The main loop just waits for a key (1-9) and then sets the global variable for the interrupt routine load accordingly. It then flashes the led a few times using a software loop delay. So when the interrupt load is high, the led will flash at a lower frequency.

When I start this program and hit '6' (which is about 50% load for the interrupt routine) then the main loop freezes. Led does not flash anymore. But the interrupt routine continues (pin 16 J56 still toggling).
When pausing and resuming with the debugger (a few times), the main loop starts again! And from then on for any number that I hit the main loop does not freeze. Even for the highest load of the interrupt routine.

Note that this all does not happen always, so maybe you need to do it a few times.

Note also that when you firstly hit a low number like '3'. The main loop does not freeze. Even if after that selecting a high number. The problem does not occur! So could it be cache related?

Note also that there is still some code in the program that is not executed or does useless things. But when ever I change something and recompile, the problem can go away. This version is the smallest I could make which shows the problem. At least on my MIMXRT1010-EVK board (I have two EVK boards and both show the same problem).

I used a recent build SDK with version 2.16.000 (847 2024-07-12) Manifest version 3.14.0

I used a recent install of MCUXpress IDE: v24.9 [Build 25][2024-09-26]

simmania

I noticed something very strange (and interesting) when adding these lines to the interrupt routine (where g_mode case 3 is handled):

if (!GPIO_PinRead(BOARD_INITPINS_USER_BUTTON_GPIO, BOARD_INITPINS_USER_BUTTON_GPIO_PIN)) {

g_delay = 20;

}

These extra lines makes it possible to switch to a very low load for the interrupt routine (by making g_delay = 20) when the USER_BUTTON (SW4) is pressed.
(The USER_BUTTON also needs to be set as input and a pullup is needed)

When the main loop freezes and you press the USER_BUTTON, the interrupt load goes to a low value and the main loop continues! It may freeze again, but pressing this button and it continues gain. And at some time it will not freeze anymore at all.

So the main loop does not crash or so. It just can not continue for some reason because the interrupt routine load is a bit high. Once it is very low again, the main loop continues.

Note that the main loop can freeze when the interrupt routine load is only about 50% (checked with oscilloscope). And when all is ok a load of 90 or even 95% seems no problem.

simmania

Hello @mayliu1 , I added your changes, but that didn't help.

Did you manage to reproduce the problem? I think that is a very important first step. As I said, this is a very strange problem and I already tried all kind of things.

You moved the PIT initialization to the main loop at the end of all initialization. Why?
And why want you to remove the delay?

Are these just guesses or is there a purpose? I would like to know to learn solve this kind of problems myself.

As I mentioned earlier: changing the code in any way may result in the problem not manifesting itself. This does not mean there is no problem any more. It is a ticking time bomb. The bug can manifest itself after the next code change.

You asked to insert ADC_ETC_ClearInterruptStatusFlags.

I guess this is what you wanted me to do:

uint32_t flags = ADC_ETC_GetInterruptStatusFlags(ADC_ETC_PERIPHERAL, kADC_ETC_Trg0TriggerSource);

ADC_ETC_ClearInterruptStatusFlags(ADC_ETC_PERIPHERAL, kADC_ETC_Trg0TriggerSource, flags);

If this manual clear is really needed, I think the code would never work ok. But it does sometimes run perfect.

Please run the demo, type a low digit and see the LED flash. Try some other digits too.
Then restart and type a '9' (being the largest interrupt load). Does it fail?

mayliu1

Hi @simmania ,

Thanks for your reply.

As I said before, It's not advisable to conduct floating-point calculations in the ADC interrupt(ADC1_IRQHANDLER). This kind of practice is not rational.

Question: You moved the PIT initialization to the main loop at the end of all initialization. Why?
And why want you to remove the delay?

Answer: What I moved is PIT timer start, not PIT initialization. I suggest ADC_ETC, XBAR , PIT module all initial ready, finally start PIT timer start trigger ADC_ETC module. Also, why you need add delay function.

I think you can refer to SDK demo "evkmimxrt1010_adc_etc_hardware_trigger_conv".

Question: uint32_t flags = ADC_ETC_GetInterruptStatusFlags(ADC_ETC_PERIPHERAL, kADC_ETC_Trg0TriggerSource);

ADC_ETC_ClearInterruptStatusFlags(ADC_ETC_PERIPHERAL, kADC_ETC_Trg0TriggerSource, flags);If this manual clear is really needed, I think the code would never work ok. But it does sometimes run perfect.

Answer: I think in the interrupt handling of ADC_ETC , it is usually necessary to clear the interrupt status flags (ADC_ETC_ClearInterruptStatusFlags).

If the flag bits haven't been cleared, the processor will regard the interrupt as still valid and will keep re-entering the ADC_ETC Interrupt Service Routine. It will result in excessive occupation of system resources and disrupt the normal execution flow of the program.

you can refer to SDK demo "evkmimxrt1010_adc_etc_hardware_trigger_conv" ADC_ETC interrupt handler.

BR

mayliu

simmania

Thanks @mayliu1 for the answers.

A very important question is: can you reproduce the problem or not?

mayliu1

Hi @simmania ,

I think you should reevaluate your project architecture, as it is unreasonable.

I also suggest that you modify PIT trigger cycle time . Is it possible that the ADC conversion time does not match the PIT trigger time?

I modify the code as I told to you, the project is run okay.

BR

mayliu

simmania

I think you are missing my point.

There is something weird going on that can not be explained. The problem can disappear when code is changed, even when the change has nothing to do with the problem. Therefore, changing something and saying it is fixed when it works, does not make sense. It is a ticking time bomb.

The only way to tackle this is to reproduce the problem and then find out what is going on. See it as an opportunity to find a serious bug in the SDK (or chip).

Saying that I should change my architecture is also not a way to solve this. If I want to use the interrupt this way, then I should be able to do that. But again, that is not the point. The point is that there may be a serious problem with the SDK (or chip) and I made a demo that shows the problem. Maybe I'm doing something wrong. But the changes you mentioned are not fundamental errors that I made.

And clearing an ADC_ENV interrupt inside an ADC interrupt makes no sense to me. And it works most of the times anyway. So if an interrupt is forgotten to be cleared it would not work correctly at all. Any way, I tried your changes and the problem still is present.

I am evaluating the NXP MCU for a new product line. If this problem is not solved then we will not be able to choose this MCU. I would really like to use this MCU, so please try to see this my way.

If you do not feel like you want to do that, then please add somebody else to this thread who is interested in finding a possible problem in the SDK (or chip).

simmania

I think I found the problem!
I made a new thread about it because I think it can be very valuable information for developers working with high rate interrupts.

See this thread.

The problem seems to be that for the execution of the main loop external code reads are needed which are aborted again and again by some interrupt.