Hi All,
I have bug that we cannot replicate that occurs around once every 15 million operational hours. We cannot reproduce this state on the bench. The only thing I know is that once the hardware gets in this "state" it continually reboots every 3 seconds, so, I believe its the Backup watchdog timer firing (one of the first things I turn on) as that is programmed to fire at 3.3 seconds. Also every 120 seconds an external watchdog holds reset low for 100mS if software is not running correctly, neither recover the processor back to normal running. The only way to recover is to remove power for a second and put it back on.
Anyway, questions is, how do I clear all pending interrupts? I have read the reference manual and its a little confusing how the interrupts work. My understanding is the processor powers up with global interrupts disabled and MQX enables them at some point during runtime. My concern is some interrupt ends up firing that is not handled (I do have an unhanded interrupt that requests reset which could also possibly be the reset cause) Anyway, can I manually clear all interrupts before MQX boots? I do disable all the debug pins already.
The plan is to configure all the IO to a known state, clear pending interrupts, then start the BSB init code for MQX. Any other suggestions?
Problem solved!
If you have things like EEPROM's or other I2C devices (24LC64 in my case), be sure to send a couple of clock pulses on power up (manually) to cover the case of a slave device being half way though writing data and waiting for a clock pulse, the clock pulses will unlock the SDA so the I2C bus can work again, otherwise no matter how many times you reset SDA will be low and nothing will ever happen.
Still working on a solution, time to start setting IO so I can see what part of hte software is causing it to get stuck in the endless loop of resetting...
Could you change the Subject to make it describe your problem better?
How do you clear all pending interrupts?
They should (meaning "must") clear when the CPU resets. That's a "hard reset".
if MQX is attempting a "soft restart" as a result of a watchdog interrupt, then "there's your problem".
But the BWT (Backup Watchdog Timer) and your other external watchdog both drive Reset. That should be guaranteed to reset the CPU.
These "hard resets" will reset all the masking bits in the Interrupt Controller, and more importantly, they will reset all of the internal devices that can cause the interrupts.
While on the subject of interrupts, a big trap on all of the MCF52 series chips is that you have to make sure you obey the Note on "16.3.6 Interrupt Control Registers (ICRnx)":
"It is the responsibility of the software to program the ICRnx registers with unique and non-overlapping level
and priority definitions. Failure to program the ICRnx registers in this manner can result in undefined
behavior."
That is very important, but slightly wrong. There are two interrupt controllers in there, "INTC0" and "INTC1". You only have to guarantee unique Level and Priority values within one interrupt controller. It is OK to have an interrupt in INTC0 that has the same priority and level as in INTC1.
I can see three possibilities if the CPU is actually wedged (which I don't think it is, see later).
Maybe you have some external hardware on the board that isn't being reset by the internal watchdog (which drives RSTOUT) or by the external watchdog (which drives RSTIN, and then the CPU drives RSTOUT). So maybe the external hardware is stuck in a state that causes the software to lock up and crash.
Maybe the crystal didn't start, or is running at a harmonic. Crystals are tricky things, and it is very easy to get the design wrong. You should google for "crystal margin test" which finds this good description:
http://www.nxp.com/assets/documents/data/en/application-notes/AN3208.pdf
More likely is that the CPU is locked up and wedged. Maybe it got some ESD and has now got a combination of junctions turned on that simply can't be fixed without a complete power removal. Maybe you've got undershoot and overshoot on external buses that have got the CPU or an external chip in a bad state. Do have have anything on the mini-flexbus?
I think you're on the wrong track with interrupts. After a Reset there can't be any interrupts enabled. So there's nothing to reset before MQX starts.
Do you have a loader or bootstrap that runs first? Does it turn any interrupts on and maybe leave them on by accident?
Do you have any external indication that the software is actually running? Can you change the code so it turns a LED on or something very early on (and maybe blinks it before it starts MQX and have code in your Application that flashes it after MQX has started) so that when it happens in the field you know the difference between "CPU is dead" and "code is locked up"?
The backup Watchdog is DISABLED by power-on reset. So if it is resetting the CPU then the CPU must have enabled it. So the CPU is at least running the code that enables the backup Watchdog.
The problem with the watchdog is that it doesn't leave much of a record as to why it bit. It DOES set the BWT bit in the RSR though. You can test this when your code starts up and then do "something different" if you know this has happened. One thing you can do that is useful is to have some periodic interrupt (like a timer interrupt) "look back" on the Stack and copy the interrupted program counter to a reserved location in the SRAM. This is "where the code was when the timer last went off". Then on reset, if RSR[BWT] is set you print that stored value, save it to EEPROM or send it somewhere. If you can get that program counter value it might tell you where the code was stuck on the last interrupt before the reset happened. Looking at the source it might be obvious why it was stuck.
If that doesn't do it (like it was stuck somewhere with all interrupts disabled), then you add a software watchdog. I would suggest that you program a timer at the same time as programming the BWT, set to a time a bit shorter than the BWT. Program it to interrupt at IPL7. That's an unmaskable interrupt. Change your watchdog patting code so it pats BOTH of them. I think you can set up the "Core Watchdog" to do this (generate an interrupt). Then if the soft one goes off you know the hard one is about to go off and the code is stuck somewhere. Dump the stack to somewhere non-volatile so you can get it back and inspect it later.
Good luck.
Tom
Thank you for the reply. I am working on debugging my code now. I can easily reproduce the "1 in 15 million hour" bug by heating my board up to 80C then it gets stuck in this loop.
I do have bootstrap and boot loader code before MQX which i know runs as I can see the SRAM on the Flexbus being read just before MQX boots...
Thank you for ruling out the interrupts.
I should get to the bottom of this in the next 24 hours and will post the issue.
80C! Now that's hot.
Have you checked SECF194 and SECF195 in the Chip Errata? SECF194 may match your problem.
Is your design causing any "Injection Current" into the CPU's pins? That can get worse at higher temperatures as the diode voltages drop.
I seem to remember the following problem was temperature sensitive too:
https://community.nxp.com/message/66171
Have you got the pins configured as "High Drive Strength" (10mA) or "Low Drive Strength" (2mA)? Try the other settings
Tom