Coldfire, Interrupts and the "Halting problem"

TomE · ‎05-07-2015

I've been working on a nasty problem on an MCF5235 for about 2 weeks, and what I've found might help others.

If you don't know the background to "The Halting Problem" you might want to read up on it a bit. It isn't about "Halting", but deciding, by inspection, if a particular program will run the way you expect it to.

I've just found a simple class of program that doesn't run the way I'd expect it to on the MCF5235, and what it actually did resulted in stack corruption and crashes.

It could also result in a buggy program running, but really S L O W L Y and without any obvious reason for it running like that.

Here's an example program. The critical part of this is that it triggers an interrupt, but then doesn't clear it. For simplicity this code is using the "MCF_INTC0_INTFRCL" Interrupt Forcing registers, but forgetting to clear a Peripheral Interrupt would do the same thing.

So, by inspection, what would you experienced Coldfire programmers expect to happen when I run the following code? All 29 of you following this forum apparently.

When I've accidentally done this on other CPUs I've used, the code is now stuck continuously executing the interrupt service routine, and the watchdog usually goes off some time later.

volatile int nStuckCounter = 0;  __attribute__((interrupt_handler)) static void stuck_int(void) { /*  INT_UNFORCE(int_task_switch); */      nStuckCounter += 1; }  void test(void) {     int i;      INT_SET_VECTOR(int_task_switch, stuck_int);      printf("Prior to Force, nStuckCounter = %u\n", nStuckCounter);      INT_FORCE(int_task_switch);      for (i = 0; i < 40; i++) {         counts[i] = nStuckCounter;     }     for (i = 0; i < 40; i++) {         printf("Loop %d, nStuckCounter = %d\n", i, counts[i]);     }      INT_UNFORCE(int_task_switch);      printf("After Unforce, nStuckCounter = %u\n", nStuckCounter); }

Note:

If it locked up like I expected I wouldn't be asking.
In the Interrupt Service Routine the proper thing to do is to clear the "Force" using the "INT_UNFORCE()" macro, but I've deliberately commented that out.
"INT_FORCE", "INT_UNFORCE" and "INT_SET_VECTOR" work as expected. They're not important and not a problem here.
"printf()" works as expected (polled printing to the serial port)

For "bonus points", here's the disassembly of the counting loop above, and with that information you should be able to tell me what the printed "nStuckCounter" values will be...

If I could paste disassembly into this forum, but after years of complaining about not being able to simply "paste code", and TWO "updates" this forum STILL can't do that. It is trying to paste the disassembly into a table and then word-wrapping and mangling it!

More multiple tries (you can't see how long this took). I had to manually remove all the spaces, paste it in and then reinstate all the spaces. Now it looks OK.

80101eea: 280e           movel %fp,%d4 80101ef8: 0684 ffff ff60 addil #-160,%d4  for (i = 0; i < 40; i++) {     counts[i] = nStuckCounter; } 801024d4: 2044           moveal %d4,%a0 801024d6: 2039 8080 003c movel 8080003c <nStuckCounter>,%d0 801024dc: 20c0           movel %d0,%a0@+ 801024de: bdc8           cmpal %a0,%fp 801024e0: 66f4           bnes 801024d6 <main+0xb34>

I'll post the printout I have next week.

More bonus points. Where, and in what manual is this behaviour documented?

Tom

TomE · ‎05-12-2015

I was hoping that someone would have a try at this one. There are some important consequences, and there may be a lot of devices out there with these problems.

This is what the inner loop compiles to:

80101eea: 280e movel %fp,%d4
80101ef8: 0684 ffff ff60 addil #-160,%d4
for (i = 0; i < 40; i++) {
    counts[i] = nStuckCounter;
}
801024d4: 2044 moveal %d4,%a0
801024d6: 2039 8080 003c movel 8080003c <nStuckCounter>,%d0
801024dc: 20c0 movel %d0,%a0@+
801024de: bdc8 cmpal %a0,%fp
801024e0: 66f4 bnes 801024d6 <main+0xb34>

The inner loop consists of four machine code instructions. The output is thus:

Prior to Force, nStuckCounter = 0 
Loop 0, nStuckCounter = 3 
Loop 1, nStuckCounter = 7 
Loop 2, nStuckCounter = 11 
Loop 3, nStuckCounter = 15 
Loop 4, nStuckCounter = 19 
Loop 5, nStuckCounter = 23 
... Loops 6 to 35 exactly as you'd expect ...
Loop 35, nStuckCounter = 143 
Loop 36, nStuckCounter = 147 
Loop 37, nStuckCounter = 151 
Loop 38, nStuckCounter = 155 
Loop 39, nStuckCounter = 159 
After Unforce, nStuckCounter = 208841

Even though the interrupt routine is "solidly stuck", the mainline gets to execute ONE instruction between each interrupt.

So what harm could that do?

On pretty much any other CPU, if you have a stuck interrupt, you find out about it really quickly as the device locks up solid in the interrupt routine. It doesn't get out of the door with a bug like that.

With THESE CPUs, the code still runs! It runs between 10 and 100 times slower than usual, depending on how many instructions the interrupt service routine has in it, but it RUNS. It runs slower than when you forget to enable the Program Cache (in the higher end and faster chips that have that). So if you're wondering why your code is running slower than you think it should be, maybe you have a stuck interrupt. Or you should turn the cache on. Or both.

I found this due to a different problem. The code it was running was a simple multi-threaded system using setjmp() and longjmp(), but with preemption from a high priority timer interrupt, that forced a lower priority interrupt to perform the context switch. The secondary thread looks like this (example, WAY simplified):

static void task_loop(sTask_t *a_psTask)
{
    while (true)
    {
        a_psTask->eState = TASK_STATE_RUN;   /* Allow switching */
        (*(a_psTask->pTaskFunc))(a_psTask->pUserRef);
        a_psTask->eState = TASK_STATE_IDLE;  /* Disallow switching */
        if (setjmp(a_psTask->sContextTask) == 0)
        {
            longjmp(a_psTask->sContextMain, 1);
        }
    }
}

When the timer goes off it check the "psTask->eState", and only forces a switch if it is "TASK_STATE_RUN". "TASK_STATE_IDLE" means "don't switch, I'm about to do it myself".

The higher priority interrupt checked for that and scheduled the lower priority one. It all went wrong if there were a bunch of medium priority interrupts (Ethernet, CAN, serial port) that got in between those other two interrupts. They allowed the code above to step into the "setjmp()" at one instruction per intervening interrupt. When the second interrupt went off it called "setjmp()" in the middle of the above call which corrupted the context, later causing a stack corruption.

The simple fixes were to have the second interrupt check "psTask->eState" or to disable interrupts completely around the switch.

So where's this in the Reference Manual? Nowhere that I can find. The CFPRM's documentation of "RTE" says that it restores the state, but doesn't say if a subsequent interrupt happens before or after that instruction is executed. The way it works may be a side-effect of this feature:

ColdFire Family Programmer’s Reference Manual, Rev. 3

Chapter 11

Exception Processing

11.1 Overview

ColdFire processors inhibit sampling for interrupts during the first instruction of all exception handlers.

This allows any handler to effectively disable interrupts, if necessary, by raising the interrupt mask level

in the SR.

And also it seems for the first instruction executed on any exception return.

Don't trust the "Auto Save" on this forum. It just lost me an hour's work and the "Auto Save" had only saved the first minute or two. Fortunately I copy/save to the clipboard regularly, and that saved me from having to type it all in again.

Tom

Coldfire, Interrupts and the "Halting problem"

Coldfire, Interrupts and the "Halting problem"

General