Wrong interrupt handler being called after period of normal operation (MCF5235)

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Wrong interrupt handler being called after period of normal operation (MCF5235)

Jump to solution
2,426 Views
sanhadrin
Contributor I

Hi all,

 

I'm having an issue when using the CAN controller; after a period of normal operation, the program will throw an "illegal instruction" exception. After some troubleshooting, I was able to determine that it was trying to call an interrupt handler for interrupt source 63 on controller 0 (unused interrupt). Unfortunately, I'm somewhat at a loss to determine why this is happening. SWIACK0 is reporting the interrupt source as 36 (PIT0) from within the source 63 interrupt handler, but that has a handler that is vectored without issue up to that point. I checked the vector table, thinking that there was corruption of the function pointers causing an issue, but there's no corruption present. I'm not quite sure where to look next to troubleshoot the issue.

 

One key thing I noticed is that if the interrupt handler for the CAN interrupt sources (since handler handles all 18 sources) is set to level 7, the issue will never happen, but if set to level 6 at the highest priority, it will. This seems to point to an issue where CAN interrupts not being handled in a timely fashion is causing this behaviour, but I would like to be able to determine the source of the issue with the available information if at all possible.

 

Thanks for any help you can give. Below is an account posted at Stack Overflow that has unfortunately gone unanswered so far.

 

At a certain point in my C application (running bare to the metal, supervisor mode) when using the CAN controller via a third-party library, an Illegal Instruction fault was occurring, which is caught in an ISR; by that point, the program counter, fault, and return address in the exception stack frame available to the ISR were already 0. When I first encountered it, I was able to back up the stack a bit, and saw a stack trace like this:

Thread [1] <main> (Suspended : Step)

0x0

  0x41f42200

  ...

  timerInterrupt() at timer.c:1,175 0x2432ec

  0x41902210

  ...

  main() at main.c:1,433 0x211a44

I ran the application several times with a known state that could reproduce this issue quickly, usually down to the exact same stack trace/saved instruction when the interrupt/exception before the jump to 0x0. Through testing, I noticed that the jump would only happen on the instruction following interrupts being re-enabled after being disabled, or in a section of code where interrupts weren't masked. So, I figured that this must be a user interrupt causing the issue, though I wasn't sure why it would appear to try to call a handler that wasn't set when the interrupt wasn't enabled in the mask. I'm not 100% sure of the meaning of the addresses in the IPSBAR range that precede and ISR being called, but since they're the same for each call of that ISR, I figure I could use it to indicate the source of the last interrupt/exception.

So, I added a default interrupt handler to all interrupt vectors on interrupt controller 0 before the normal handlers were added and ran the application again - and lo and behold, a breakpoint set in the default handler was hit when that suspected interrupt was fired (eg, stack looked like this):

Thread [1] <main> (Suspended : Step)

__DefaultInterrupt() at interrupts.c 0x41f42200

...

timerInterrupt() at timer.c:1,175 0x2432ec

0x41902210

...

main() at main.c:1,433 0x211a44

Observing the value of SWIACK0 in that function, I saw that the interrupt source was 100 (user interrupt 36, PIT0 interrupt). Well, that already has an ISR (timerInterrupt() in the stack above). I next checked the area of RAM where ISR function pointers were saved to see if the timer interrupt handler function pointer was corrupted, but there was no change between the time all interrupt handlers were set, and when the breakpoint in the default handler was hit.

I also noticed that if I set the interrupt level of the interrupt handler for the CAN controller to 7 (the same interrupt handles all 18 FlexCAN interrupt sources), the issue doesn't occur. I'm not sure what to make of it just yet, but the issue does absolutely point to either the CAN library or controller being at issue.

EDIT - I wasn't sure at this point exactly which ISR was handling the interrupt, but I've added individual handlers to the initially suspected interrupt sources, and it's always interrupt source 63 - which is an unused interrupt, according to the documentation, and the last one on interrupt controller 0.

EDIT 2: It occurred to me that the active interrupt source in SWIACK0 is actually correct, but there might be another issue, like the vector base address might be getting rewritten. Unfortunately I'm not sure how to read it back as it's a write-only value. I initially thought that the interrupt source for PIT0 was in that register because the default interrupt handler was getting called from within the timer interrupt handler, but it's also indicated if the timer interrupt isn't in the stack. The reference manual indicates that the on-chip debug device can be used to read back control registers and therefore VBR, but I don't see any information in the debug manual to do this.

To make a rambling story short, I want to find out the source of the jump to hyperspace, or what information I can use to get it.

  • What's the meaning of the addresses in the IPSBAR range getting pushed onto the stack?
  • Since those addressed seem to be completely tied to their source, is there a way to use a value in the stack (eg, 0x41f42200 in the first example) to determine the source of this interrupt/exception that
    pushed it onto the stack?
  • Am I going about this completely wrong? I'm more than happy to
    abandon any and all of this line of thinking.

Thanks for any help or insight, and I'll update this with more (concise) information when I can rub two brain cells together to do it.

Labels (1)
0 Kudos
1 Solution
1,430 Views
TomE
Specialist II

> It occurred to me after talking about the interrupt level/priority

> for the ISR that this doesn't actually make sense,


When did that ever matter in electronic design and programming? Most design decisions are "accidents" based on compatibility with previous "accidents". The number of design bugs I've had to work around in the last few years...The Interrupt chapter says the ICRs MUST be programmed with "unique and non-overlapping level and priority definitions". It also states "Thus, a maximum of 8 fully-programmable interrupt sources are mapped into a single interrupt level.". A paranoid reading of that says you have to burn six priorities across three levels for 18 sources if you're using FlexCAN, or at least use less message buffers. I suggest you program it this way and see if your problems go away. We have MCF5235 FlexCAN code at work. I'll check it on Tuesday and see how we programmed the levels.


> as levels are based on source, not handlers,


A source is a track on the chip coming from a hardware module to the interrupt controller. There are 18 of these connecting the FlexCAN controller to the Interrupt Controller.


> and the issue actually is the case in #1 - the same interrupt level/priority

> is actually used across all interrupt sources for the CAN controller.


Not according to the manual.

21.5.1 Interrupts

There are three interrupt sources for the FlexCAN module. A combined interrupt for all 16 MBs ...

The other two interrupt sources (bus off and error) act in the same way, and are located

in the ERRSTATn register


Now  it might be possible to read the above to say that you'll only get ONE MB interrupt request (from FlexCAN to the interrupt controller) at the one time, but there's nothing in the FlexCAN documentation to say there's an inbuilt fixed priority in the interrupt requests, so I'd say that the 16 IFLAGn bits are combined with the 16 IMASKn bits and routed directly to the interrupt controller. Normally with CAN you can only get one MB going active at any one time as you can only send or receive one buffer at a time. if your mainline code is fast enough to guarantee to service that interrupt before the next one comes along you'll only get one at a time. But what happens if your code is "busy somewhere else" in an ISR or in the mainline with interrupts disabled, and TWO MBs are now requesting interrupts? What happens if MB15 is requesting interrupt 58 (0x3a 111010) and MB10 is requesting 53 (0x35 110101). The Interrupt Controller has to use the Levels to decide which vector to generate. If they're the same then the vectors will BOTH be driven onto the internal bus and will AND or OR or simply short each other out. The words "undefined behavior" mean "the chip can do whatever it wants, and it won't be what you want".

On the other hand "combined interrupt" may mean "the first one set wins" and blocks the other ones until it is cleared, but it isn't "combined" on the one track to the interrupt controller as it needs extra information to generate the right vector. Don't always trust the person who wrote the documentation knew what they were doing. They may mean "combined" because the status and mask bits appear in the same registers. That may be all it means.

Bus Off and Bus Error are separately generated, and they certainly need separate levels/priorities.


> I only remember swapping one level/priority combination for the CAN controller,


I'm pretty sure you need 18, or you need "2 plus the number of MBs in use".


> Setting "the" interrupt level/priority for all CAN interrupt sources to level 7 likely

> prevented pending interrupts from accumulating on the CAN bus while interrupts were disabled


I think you've explained why the "IPL7 fix" worked. It stopped you getting a second CAN interrupt while a previous one was pending. Do you realise that unless you cleared all the 16 IMASKn and the other two mask bits at the start of the interrupt (and enabled them at the end) then you're likely to have recursive interrupts, with new CAN interrupts interrupting the previous CAN interrupt service routine? ISRs normally aren't written to handle that sort of abuse properly and it causes all sorts of other rare and impossible to diagnose problems.

Of course you can't be the only person on the planet with this problem. So I'm ashamed I wrote all of the above without SEARCHING for previous reports first.

I wonder how the Freescale operating systems like MQX set up the FlexCAN interrupts?

The wrong way of course! They set them up to the same level and this simply doesn't work as the following customer found.

Re: Unhandled Interrupt vector 0x9f (159)

- Second, MQX does exactly what the manual says not to do. In function FLEXCAN_Int_enable, called as follows:

Here's someone else programming them all with unique values, so they know that's the way to do it:

MCF5282 FlexCAN receive masks corrupted

Tom


View solution in original post

0 Kudos
6 Replies
1,430 Views
TomE
Specialist II

You've provided plenty of details, but I'm going to be lazy today. I'll let you know about two common interrupt problems that you should check for. It is likely one or both won't match your problem in which case I'll read your report properly.

Priority Level Assignment

The MCF52xx interrupt controllers have a nasty "gotcha" that the MCF53xx ones don't have. The MCF53xx ones all have fixed priorities within the one level. That means you can't change the priorities, but it also means you can't get them wrong. With the MCF52xx ones you have the flexibility of reordering the priorities within the one level, with the risk that you will mess it up:

13.2.1.6 Interrupt Control Register (ICRnx, (x = 1, 2,..., 63))

Note: It is the responsibility of the software to program the ICRnx registers

with unique and non-overlapping level and priority definitions. Failure to

program the ICRnx registers in this manner can result in undefined behavior.

With the MCF53xx you can have all 128 interrupt sources set to the same level. With the MCF52xx you can only have EIGHT sources on the one level. Since you have 103 programmable sources, and only 6 maskable levels (1 to 6) that tells me you can only use 6*8=48 out of the 103 interrupts in a design. If your software define lists all the interrupts in the system in the one file in Level and Priority order it is pretty easy to check that you've got them unique. If (as is more likely) the ICR registers are accessed all over the place in the different device drivers, it is difficult to make sure they're all different and too easy to change the level on one device without changing the priority, and colliding with a different device. That's why moving to "7" fixes your problem. Moving to "5" or "4" might fix it too.

You're seeing PIT0 Interrupt on vector 36. That's 0x24 or 00100100. CAN generates 43-60 (0x2b to 0x3c). If you're getting a CAN interrupt with the bit pattern  00x11x11 (meaning 00111011 or 0x3B or 59) which is the "Error Interrupt" at the same time as PIT0 then if they have the same level and priority they'll add together to generate a fake interrupt of (36 | 59 = 63).


Spurious Interrupt
This probably isn't your problem. If it was you'd be seeing the Spurious Interrupt (number 24). It is worth making sure that all accesses in the code to the IMRs are always protected by setting the CPU IPL to "7". It may not be a problem now, but at some time you'll change the code (or someone will copy it for a different project) and the problem will bite them then.


13.2.1.2 Interrupt Mask Register (IMRHn, IMRLn)

NOTE If an interrupt source is being masked in the interrupt controller mask register (IMR) or a module’s interrupt mask register while the interrupt mask in the status register (SR[I]) is set to a value lower than the interrupt’s level, a spurious interrupt may occur. This is because by the time the status register acknowledges this interrupt, the interrupt has been masked. A spurious interrupt is generated because the CPU cannot determine the interrupt source. To avoid this situation for interrupts sources with levels 1-6, first write a higher level interrupt mask to the status register, before setting the mask in the IMR or the module’s interrupt mask register.


To be clear, all accesses to the IMRs that clear a bit should be wrapped in a "disable everything / restore" pair.


Tom

1,430 Views
sanhadrin
Contributor I

EDIT: It occurred to me after talking about the interrupt level/priority for the ISR that this doesn't actually make sense, as levels are based on source, not handlers, and the issue actually is the case in #1 - the same interrupt level/priority is actually used across all interrupt sources for the CAN controller. I only remember swapping one level/priority combination for the CAN controller, so I imagine that to be the case, and recalling how the CAN interrupts are enabled in the first place, there was definitely no way to split the 18 interrupt sources across three separate levels. The scenario I set up which could cause this failure involved sending a high number of PDOs over CAN at the same time as doing a high number of object value reads from another device via SDO; sniffing the bus showed that the failure always occurred at a point where a response from the remote device was retrieved, but not processed. D'oh. Setting "the" interrupt level/priority for all CAN interrupt sources to level 7 likely prevented pending interrupts from accumulating on the CAN bus while interrupts were disabled.

Hi Tom,

Thank you very much for your quick and detailed response. I'm sure I missed some important information since I've been so wrapped up in it.

As for priority/level assignment, we do have a header of defines for priority/level to have some measure of surety that there isn't a collision (with that very note from the interrupt controller chapter as the header). Reading all of the ICRnx registers from within the default interrupt handler was one of the things I've been considering over the weekend, but the scenario that you give which would show the interrupt source show exactly as I'm seeing it has convinced me that it's the first thing to check on Monday.

As for the case that setting the CAN interrupt level to 7 solving the issue, I did also set the CAN ISR to level 6, priority 7, with all other level 6 interrupts given unique, lower priorities, with the problem continuing with the exact same symptoms. But, without looking at all ICRnx registers, I can't guarantee that I didn't just rearrange deck chairs on the Titanic.

As for the spurious interrupt, I did consider that situation after reading that section in the interrupt controller manual last week, but the only time the IMR is modified is during the setting of interrupt vectors during initialization. All masks of interrupts during normal operation are via the interrupt level mask in the status register alone.

I think it would be difficult for me to say anything with any certainty without the code in front of me, but I'll update tomorrow when I do. Again, thanks.

0 Kudos
1,431 Views
TomE
Specialist II

> It occurred to me after talking about the interrupt level/priority

> for the ISR that this doesn't actually make sense,


When did that ever matter in electronic design and programming? Most design decisions are "accidents" based on compatibility with previous "accidents". The number of design bugs I've had to work around in the last few years...The Interrupt chapter says the ICRs MUST be programmed with "unique and non-overlapping level and priority definitions". It also states "Thus, a maximum of 8 fully-programmable interrupt sources are mapped into a single interrupt level.". A paranoid reading of that says you have to burn six priorities across three levels for 18 sources if you're using FlexCAN, or at least use less message buffers. I suggest you program it this way and see if your problems go away. We have MCF5235 FlexCAN code at work. I'll check it on Tuesday and see how we programmed the levels.


> as levels are based on source, not handlers,


A source is a track on the chip coming from a hardware module to the interrupt controller. There are 18 of these connecting the FlexCAN controller to the Interrupt Controller.


> and the issue actually is the case in #1 - the same interrupt level/priority

> is actually used across all interrupt sources for the CAN controller.


Not according to the manual.

21.5.1 Interrupts

There are three interrupt sources for the FlexCAN module. A combined interrupt for all 16 MBs ...

The other two interrupt sources (bus off and error) act in the same way, and are located

in the ERRSTATn register


Now  it might be possible to read the above to say that you'll only get ONE MB interrupt request (from FlexCAN to the interrupt controller) at the one time, but there's nothing in the FlexCAN documentation to say there's an inbuilt fixed priority in the interrupt requests, so I'd say that the 16 IFLAGn bits are combined with the 16 IMASKn bits and routed directly to the interrupt controller. Normally with CAN you can only get one MB going active at any one time as you can only send or receive one buffer at a time. if your mainline code is fast enough to guarantee to service that interrupt before the next one comes along you'll only get one at a time. But what happens if your code is "busy somewhere else" in an ISR or in the mainline with interrupts disabled, and TWO MBs are now requesting interrupts? What happens if MB15 is requesting interrupt 58 (0x3a 111010) and MB10 is requesting 53 (0x35 110101). The Interrupt Controller has to use the Levels to decide which vector to generate. If they're the same then the vectors will BOTH be driven onto the internal bus and will AND or OR or simply short each other out. The words "undefined behavior" mean "the chip can do whatever it wants, and it won't be what you want".

On the other hand "combined interrupt" may mean "the first one set wins" and blocks the other ones until it is cleared, but it isn't "combined" on the one track to the interrupt controller as it needs extra information to generate the right vector. Don't always trust the person who wrote the documentation knew what they were doing. They may mean "combined" because the status and mask bits appear in the same registers. That may be all it means.

Bus Off and Bus Error are separately generated, and they certainly need separate levels/priorities.


> I only remember swapping one level/priority combination for the CAN controller,


I'm pretty sure you need 18, or you need "2 plus the number of MBs in use".


> Setting "the" interrupt level/priority for all CAN interrupt sources to level 7 likely

> prevented pending interrupts from accumulating on the CAN bus while interrupts were disabled


I think you've explained why the "IPL7 fix" worked. It stopped you getting a second CAN interrupt while a previous one was pending. Do you realise that unless you cleared all the 16 IMASKn and the other two mask bits at the start of the interrupt (and enabled them at the end) then you're likely to have recursive interrupts, with new CAN interrupts interrupting the previous CAN interrupt service routine? ISRs normally aren't written to handle that sort of abuse properly and it causes all sorts of other rare and impossible to diagnose problems.

Of course you can't be the only person on the planet with this problem. So I'm ashamed I wrote all of the above without SEARCHING for previous reports first.

I wonder how the Freescale operating systems like MQX set up the FlexCAN interrupts?

The wrong way of course! They set them up to the same level and this simply doesn't work as the following customer found.

Re: Unhandled Interrupt vector 0x9f (159)

- Second, MQX does exactly what the manual says not to do. In function FLEXCAN_Int_enable, called as follows:

Here's someone else programming them all with unique values, so they know that's the way to do it:

MCF5282 FlexCAN receive masks corrupted

Tom


0 Kudos
1,430 Views
TomE
Specialist II

I wrote:

> We have MCF5235 FlexCAN code at work. I'll check it on Tuesday and see

> how we programmed the levels.

I'm surprised by this. Let's see if I can use the ">> / Insert / Syntax Highlighting / C++:

typedef struct

{

    uint8_t controller;

    uint8_t source;

    uint8_t level;

    uint8_t priority;

} INT_INFO;

const INT_INFO int_can_buf[CAN_NUM_BUSES][CAN_NUM_MSG_BUFS] =

{

    {

        { 1, 8, 6, 7 }, { 1, 9, 6, 6 }, { 1, 10, 6, 5 }, { 1, 11, 6, 4 },

        { 1, 12, 6, 3 }, { 1, 13, 6, 2 }, { 1, 14, 6, 1 }, { 1, 15, 6, 0 },

        { 1, 16, 5, 7 }, { 1, 17, 5, 6 }, { 1, 18, 5, 5 }, { 1, 19, 5, 4 },

        { 1, 20, 5, 3 }, { 1, 21, 5, 2 }, { 1, 22, 5, 1 }, { 1, 23, 5, 0 }

    },

    {

        { 0, 43, 6, 7 }, { 0, 44, 6, 6 }, { 0, 45, 6, 5 }, { 0, 46, 6, 4 },

        { 0, 47, 6, 3 }, { 0, 48, 6, 2 }, { 0, 49, 6, 1 }, { 0, 50, 6, 0 },

        { 0, 51, 5, 7 }, { 0, 52, 5, 6 }, { 0, 53, 5, 5 }, { 0, 54, 5, 4 },

        { 0, 55, 5, 3 }, { 0, 56, 5, 2 }, { 0, 57, 5, 1 }, { 0, 58, 5, 0 },

    }

};

const INT_INFO int_can_err[CAN_NUM_BUSES] =

{

    { 1, 24, 4, 7 } , { 0, 59, 4, 7 }

};

const INT_INFO int_can_boff[CAN_NUM_BUSES] =

{

    { 1, 25, 4, 6 } , { 0, 60, 4, 6 }

};

So we're using IPL levels 4, 5 and 6. You'll also note we're using the *SAME* IPL/Priority pairs for the same buffers on the two different FlexCAN controllers. How can we get away with that? This paragraph:

13.3 Prioritization Between Interrupt Controllers

The interrupt controllers have a fixed priority, where INTC0 has the highest

priority, and INTC1 has the lowest priority. If both interrupt controllers have

active interrupts at the same level and priority, then the INTC0 interrupt will

be serviced first. If INTC1 has an active interrupt that has a higher level or

priority than the highest INTC0 interrupt, then the INTC1 interrupt will be

serviced first.

Which needs to be read in conjunction with the following, which isn't quite correct:

It is the responsibility of the software to program the ICRnx registers

with unique and non-overlapping level and priority definitions.

What the above should say is "ICRnx registers  within each interrupt controller ..." as you can re-use the level/priority pairs BETWEEN the controllers!

So when I read the above and concluded::

Since you have 103 programmable sources, and only 6 maskable levels

(1 to 6) that tells me you can only use 6*8=48 out of the 103 interrupts in a design.

That's wrong. It is "2*6*8=96 out of the 103 interrupts", which is better. But NOT obvious!

Tom

0 Kudos
1,430 Views
sanhadrin
Contributor I

Hi Tom,

Thanks a lot for the advice and reading material - it was exactly what I needed. Setting the ICRnx registers for the FlexCAN sources to unique values was the majority part of the solution in this case - my test case which could reliably cause the failure within several seconds has now run for an hour with absolutely zero issues.

Again, thank you!

0 Kudos
1,430 Views
TomE
Specialist II

There's another thing to watch out for. I mentioned that if you're using IPL7 then you may get "recursive interrupt calls". If you're using (like we do) interrupt levels 4, 5 and 6, then an IPL6 FlexCAN interrupt can interrupt a currently-running FlexCAN interrupt service routines for IPL4 and IPL5. The only really safe way to handle this is for all the FlexCAN ISRs to set the CPU IPL to the highest one used by all of the FlexCAN requests. So in this case they should all force the CPU to IPL6. We don't have to do that as we're using 14 buffers as a "queue", filling them up and only enabling an interrupt on the LAST one to be sent. So we only expect one transmit interrupt at a time.

Tom

0 Kudos