Spurious interrupt with __declspec(interrupt)

kevb · ‎04-24-2017

Hi all,

I am using a MCF5474 with CodeWarrior 6.4. I recently introduced an exception handler to count SLT0 overflows that runs on Level 7. It is registered with

static void setup_time_counter()
{
MCF_SLT_SLTCNT0 = 0xFFFFFFFF; /* count from 2^32 */
MCF_SLT_SCR0 = MCF_SLT_SCR_IEN /* enable interrupt */
| MCF_SLT_SCR_TEN /* enable timer */
| MCF_SLT_SCR_RUN; /* run continuously */
MCF_INTC_ICR54 = 0x39; /* priority 7.1 */
MCF_INTC_IMRH &= ~MCF_INTC_IMRH_INT_MASK54;
}

The interrupt handler looks like this

static unsigned long counter_wraps = 0;

__declspec(interrupt) void slice_time_counter()
{
/* just count the overflows of the slice timer */
++counter_wraps;
MCF_SLT_SSR0 = MCF_SLT_SSR_ST; /* reenable IRQ */
}

And in the vector table

.extern _slice_time_counter

....

vector76: .long _slice_time_counter

I now randomly get spurious interrupts. They happen about 2-8 times a day.

If I change the __declspec(interrupt) line to #pragma interrupt on .... #pragma interrupt off the problem goes away. The only difference I see in the generated assembly code is that the first line in the handler is "move.w #0x2700, sr" for __declspec. This line is missing with #pragma. No doubt, writing to the status register should not be necessary as the IRQ level is 7 anyway, so the line is redundant and can go away. But why does it cause spurious interrupts?

Is this a known problem? What is the cause?

TomE · ‎04-25-2017

> I now randomly get spurious interrupts.

Are they "a spurious interrupt" or "THE spurious interrupt"? The "definitive spurious interrupt" is Vector 24, and happens if "there is no active interrupt source at the time of the level-n IACK". It is possible you have all unused interrupt vectors set with an "uninitialised interrupt vector" (a good idea to catch this error case) and might be referring to that as a "spurious interrupt".

Tom

kevb · ‎04-25-2017

Yes, it is "THE spurious interrupt" aka vector 24.

We have initialised the whole vector table. Every unused vector is set to a crash handler that prints the CPU registers and if possible a backtrace to the serial console. It then enters an endless loop.

TomE · ‎04-26-2017

I think I know what's going on.

But first, "2 to 8 times per day" is what percentage (or parts-per-million) of the timer expiry. How infrequent is this?

And what is the expiry rate of the Slice Timers? Since I have your Slice Timer setup code I should be able to work that out myself, except there's nothing in the Slice Timer chapter or the Clocking chapter that actually says which clock goes into the Slice Timer! At least not where I can quickly find it. I expect to find a "clock tree diagram" in the manual like there is in most other ones, but in this manual - nada, zip.

So I'm guessing it is the XLB clock, probably running at 100MHz. Which for a 32-bit counter means it times out at 0.023Hz or every 42.9 seconds. So that means 2011 times per day. So "2 to 8 per day" is 0.1% to 0.4%.

How's my guessing going so far?

So here's what I think is happening. The timer expires, sets the "ST" bit and that works its way through the interrupt controller to be gated with the masks, combined with other interrupts and priority encoders to make the interrupt request. At an appropriate point in the CPU's execution, this causes the CPU to start the exception sequence, perform the "Space Read" on the bus and get the vector from the interrupt controller. Then it runs the service routine.

Meanwhile the hardware interrupt request is still active.

Most interrupt service routines read the status, clear the request, and then have a whole lot of work to do, accessing whatever data caused the interrupt, pushing it into ring buffers and triggering threads. So the code guarantees a minimum time between when the request is TOLD to go away and when the interrupt routine returns.

Then, the last thing (really, literally the LAST thing) your code in the service routine does is to launch a write back to the Slice Timer to make the interrupt request go away. On these CPUs, a write cycle to a peripheral can take 10 to 20 CPU clocks to execute. The Slice Timer looks to be in a "faster clock domain" than the usual slow peripherals, but it is going to take a while.

When the write executes, the hardware request is removed, and that removal has to ripple back through the interrupt controller and go away BEFORE the CPU has executed the RTE and restored its IPL.

If that hasn't happened yet you'll get THE Spurious Interrupt.

In the usual parlance, I think you need a "Write Barrier Instruction" after the write to "MCF_SLT_SSR0".

Or maybe swapping that instruction and "++counter_wraps" will make the problem go away.

Remember this CPU is "lightly superscalar" and can do some operations in parallel, or at least one-per-clock.

To try and prove this without having to wait for a day I'd suggest dropping the slice timer timeout so it expires at least 1000 times faster. Or more. That should make it possible to see if the problem is still happening in a few seconds.

So why is this intermittent? Why does the "declspec" make it happen more? I suspect that has something to do with either the instruction that was interrupted (unlikely) and/or the caches.

If the caches are flushed, then the ISR takes a while to read in the instructions before it can execute them. If the ISR code is still in the instruction cache when the interrupt happens, maybe it can get from the last instruction to the interrupt return faster, and fast enough to trip the hazard. The cache lines are flushed "randomly" so maybe about one in 500 times the cache line with the service routine got "lucky" and didn't get overwritten.

The execution speed of "++counter_wraps" will depend on whether that variable is still in the data cache or whether it has to be fetched from main RAM. Ditto "lucky cache preservation".

The "declspec" inserts an instruction that takes FOUR CPU clocks, and during that time the CPU can probably load up its instruction pipeline further than when that instruction isn't there. A "NOP" (6 clocks) or one or more "TPF" instructions might have the same effect.

I think the proper fix for your problem is to add a "NOP" after the write to MCF_SLT_SRR0, or to read back that register (to guarantee write completion). The syntax for that is usually just "MCF_SLT_SRR0;" as long as it is "volatile". Note that "NOP" isn't a "No Operation". "TPF" is "NOP". "NOP" is "Pipeline Flush".

Tom

TomE · ‎04-25-2017

There are two "usual" causes of spurious interrupts on 68k and ColdFire chips.

The first is disabling the peripheral or interrupt controller interrupt mask without raising the IPL high enough to disable that interrupt. The hazard happens when an interrupt is raised very close to where it is disabled. "Spurious" means the interrupt controller asked for the vector and one wasn't forthcoming.

The second is getting the MCF52 series interrupt controller programming wrong and not having unique "priority" levels for every interrupt. This was fixed in the MCF53 series by having the priority levels hard wired. So which model does the MCF54 follow? The tricky one. The "ICRn" registers have both "IL" and "IP" fields. "13.2.1.6 Interrupt Control Registers 1–63 (ICRn)" warns about getting this wrong. That can be nasty as it can generate wrong interrupts (the logical OR or AND of the vectors usually).

It is unlikely that this is your problem, but I'd check it anyway.

Another possibility is that you've got some mainline code accessing the "slice timer" and doing something to disable interrupts, either by disabling, reprogramming or writing the status bit. Do you have that anywhere? Good programming practice requires that these actions have to be performed with the IPL raised. Even if you have code doing that it isn't going to work, as this interrupt is at IPL7 and thus both edge-sensitive and un-maskable. So it may be handling the interrupt while it is being cleared.

Using IPL7 is tricky, dangerous, and best avoided unless absolutely necessary. The fact that it is un-maskable removes all the normal "safety barriers", and disables a lot of normal programming practices. If you can, I'd drop it to IPL6.

Or if you're using IPL7, make sure you only have one of them or you'll have IPL7 interrupt routines interrupting each other, with the risk of going recursive. Do you have this?

If you're using IPL7 because you've run out of unique interrupt levels and priorities (the "IL" and "IP" fields), then remember you CAN have duplicate interrupts with the identical "IL" and "IP" values, as long as they're on different interrupt controllers! That doubles the number of interrupts you'd otherwise expect. So there are about 112 combinations available, or 96 if you exclude all IPL7 ones for the 63 interrupt sources.

The first interrupt cycle is "special" as it can't be interrupted. That's there so you can have the "move.w #0x2700, sr" there to force "single level interrupts" if that's what you need.You really don't want that there as that instruction takes FOUR clocks, and that means it must be messing with the internal execution pipeline. The extra four clocks might be what is causing the problem as it might be making the service routine longer and exposing a timing hazard somewhere.

You may have found a bug where writing to the Status Register somehow re-enables interrupts, and you're getting a second IPL7 interrupt from that source or from a different one.

Are you using the MMU? Could it be involved in this problem?

Let us know what you find.

Tom

kevb · ‎04-25-2017

Hi Tom,

at first we had indeed a second interrupt at the same level and priority, but this was changed and did not help. The slice timer IRQ is never disabled somewhere else. Once unmasked it runs all the time.

We often write to the status register to disable interrupts to do "atomic" operations. This is why we set the SLT0 IRQ to non-maskable. We use it to do timing measurements. It is not absolutely necessary to use level 7. But what I did not mention in my original post is that using another level did not help. We tried level 1, 3, and 5. Always spurious interrupt. Only replacing __declspec with #pragma helped.

TomE · ‎04-27-2017

> This is why we set the SLT0 IRQ to non-maskable. We use it to do timing measurements.

So you're reading the SLT0's SCNT0 register in order to get a timestamp, and are using "counter_wraps" to extend this to 64 bits and beyond 43 seconds (approx). How do you detect "wrap during read", are you reading "counter_wraps" twice and retrying in a "do ... while" construct?

I use the DMA Timer on MCF52 and MCF53 chips to do the same thing, conveniently set to count at 1MHz. The quickest way to code the timing is

uint32_t t1 = - MCF_TIMER_DTCN1;

something that I'm timing;

t1 += MCF_TIMER_DTCN1;

With a 1MHz timer I don't have to worry about it overflowing for over 71 minutes.

Have you noticed that "Table 12-4. SCNTn Field Descriptions" says:

Timer count. GPIO output bit set. Provides the current state of the timer counter.

"GPIO output bit set"? Where did that get pasted in from? It is also in "Table 12-2. STCNTn Field Descriptions". I can usually trace through where these things came from (which bits of other manual got incorrectly copied) by searching on NXP's site or performing an "exact match" search on Google. In this case that phrase only shows up in the MCF547x and MCF548x manuals, and surprisingly, Google doesn't get any matches, anywhere on the planet. It will probably find the phrase in this post after I've sent it.

Tom