Lock Up Problem with 9S12A64 Part

tocard · ‎12-22-2006

I am using a 9S12A64 part on a custom board to control all aspects of a commercial dishwasher. I have been experiencing a non-cyclic repeating, but intermittent, problem where the control seems to lock up for no apparent reason. By “lock up”, I mean that no processing _seems_ to be occurring, any outputs that were high stay high; any that were low stay low; there is no communication out the SCI ports; other scheduled events do not seem to occur. (It is analogous to what would happen if you removed the clock completely; though the problem does not seem to be loss of clock, as an oscilloscope shows both the 4.1MHz crystal and 24.6MHz PLL waveforms.)

I am using the internal COP and it does not always reset the processor. Sometimes, I’ve gotten reports from the field that the machine “turned itself off” (which is what would happen after a COP reset). But much more often, I receive the complaint that system and screen froze with no way to continue without shutting off main power at the wall.

Furthermore, the problem seems to occur at random intervals; though, it does repeat at all of the field sites that we are currently testing. The longest time we’ve seen between failures in the field is approx. 25 days. The shortest was just under 1/2 hour. The mean seems to be between 12 and 14 days. We have not been able to find any way to repeat the failure in the lab, save for just running a machine continuously for some amount of time (typically, the 12-14 days). Once the failure occurs, the only way to continue is to toggle the micro’s /RESET line low or completely remove power to the machine.

We have been and still are in communication with Freescale AEs, but would like your input, too. We have tried most of the basic things without being able to repeat the problem(including physically removing the crystal and external PLL circuitry, forcing an infinite software loop, etc.). We would appreciate any feedback you might have, be it vanilla or exotic.
Also, we do not know if the problem is hardware or software. All of our standard hardware tests (EMI/RFI, UL, etc.) have passed without incident.

Has anyone heard of a lock-up problem like this? Does anyone have any suggestions on a possible cause or solution? This is the first time we’ve used any of the S12 parts and we don’t have any experience with it. (We have extensive knowledge of some of Freescale’s 8-bit parts, however).

---------------------------------------
To give a little more detail:

I am using several of the A/D channels, both SCI ports, the PLL set to 24.576MHz (using a 4.096MHz crystal), and the internal watchdog/COP (set to a 4.096second timeout period) on the microcontroller. I also use the onboard EEPROM to hold some system setup parameters and a log of recent error that have occurred in the system. The machine itself uses a variable frequency inverter drive (VFD) as part of its operation. There is 1 RS232 serial display connected to the control (via 1 of the SCI ports) and 1 RS485 communication line connected to the VFD through an RS232/RS485 transceiver connected to the other SCI port.
I am using a 10ms real-time interrupt to schedule tasks in a super-loop. I use the PLL lock interrupt to begin (or stop) using the PLL upon a lock. I use both SCI interrupts to asynchronously control the flow of data into/out of the ports. All of the other unused interrupts cause the control to reset by loading the reset vector location into the index register X and jumping to that location:
LDX 0xFFFE
JMP 0,X

Finally, I have been authorized to bring in outside help. If anyone has specific knowledge with the 9S12 series (and, ideally, the A family) and would be willing to contract some time to us for pay, please PM me with your E-mail address and/or phone number to discuss more details. We are located in southwest Ohio and would prefer someone in the vicinity, but are prepared to pay for travel.

Thanks for any help you can give on the forum!!

admin · ‎02-27-2007

Heh, I came here looking for answers to a similar random lockup on my 68HC908, which I'm using in a prototype engine control system for commercial use. Maybe I can offer something useful though if not to you, at least perhaps to others who end up looking at this thread. You probably know all this if you've had plenty of experience with the 8 bit devices but here goes:
I have seen lockups, with a number of Motorola and now Freescale micros, where the ports (particularly inputs) are not well protected with either buffering, or well designed low pass filtering. Longish tracks, and more especially, wiring run straight from a switch or sensor, to the port pin picks up spikes, and causes random port latchup, and sometimes apparent complete lockup of the processor. Reset doesn't get the thing running again, as I've had external watchdogs, as well as the internal COP to keep an eye on the program. The only thing that gets the device going again is a complete power off / on cycle. Anyway, sometimes just the addition of a simple R/C filter network has been sufficient to alleviate the problem, other times, fully buffering the port pins with TTL for digital inputs, or in the case of the AtoD inputs, unity gain op amps were necessary.
Have you tied ALL unused inputs to a logic level, at least through a resistor?
Hope this helps somebody. In the meantime, I'll keep looking for a solution to my own woes with the HC908

mke_et · ‎12-22-2006

First off, can you cause the problem in the lab with test equipment handy to look at things? I know, with a single chip, that can sound like a dead end, but the thing is do you KNOW what is actually happening when it appears to freeze?

I had a project on the A128 that was locking up. Turned out I had an EEPROM issue But it was really tough finding out what was causing it. At least I could duplicate the problem while I had it on the bench!

In another case I had a hint for a different lockup. I was using a timer running at .001sec to 'dim' an LCD backlight by pulsing it. Turned out with the type of backlight I was using I could cut the power by 40% and only see a barely perceptible dimming of the display. Anyway, when my system hung, I'd see the backlight either solid on or totally off.

Another time I had each routine set some output pins to a know state (like putting up a 'post code') so that let me find about where it locked. Once I found that, I then put codes all over that area and found out what was going on.

I guess it all depends on if you can get this to fail where you can catch it.

tocard · ‎01-04-2007

Thanks for the input, MKE! Happy New Year!

We do not know exactly what is happening (or where or why, for that matter) when the control freezes. Tracking this down has been daunting and all our basic troubleshooting methods (like toggling outputs at various points, setting breakpoints, etc.) have failed.

We can reproduce the problem in the lab, but cannot cause the problem to occur "on demand". As out in the field, it takes approx. 2 weeks before we'll see a failure in the lab. However, I can hook up all sorts of equipment the machine here in the lab, including a BDM. And, I do have 8 LEDs on unused outputs that I am toggling at different suspect places in the code to see if that's where I'm getting stuck. I've placed them at different places around the code, but so far, they've told me nothing. And once the control locks up, communication with the BDM also fails and I cannot reconnect without resetting the microcontroller. Thus, I cannot look at registers, stack, etc. after the problem occurs.

Unfortunately, one of two outcomes occurs when the control fails:
1) The COP resets the control, like it is supposed to, but resets the LEDs and erases any useful history I could have gotten from them.
2) The COP _tries_ to reset the control, but doesn't finish. It gets far enough to enter the COP vector, which means a reset has already occurred and all GPIO registers are set to inputs (thereby turning off my LEDs), but does not actually jump back to the POR vector location (0x4000). [I know this because I have an LED turn on as one of the first events when entering the COP reset vector. When the control locks up, this LED is on.]

Let me clarify #2 a bit more. When the problem was first reported in the field, I had forgotten to remove debug code that allows the COP to be disabled in STOP mode (which allows for BDM debugging) and I did not have an LED to indicate that a COP vector occurred. My initial assumption was that the control was somehow getting to a STOP or BGND opcode and waiting for some external stimulus to continue. So I recompiled without the debug code and forced STOP to be an illegal opcode and allowed the COP to continue to run in both STOP and WAIT states. I also added in the LED for the COP vector and some code to dump the stack out to EEPROM during a COP vector before jumping to 0x4000.
It now seems that sometimes the COP will vector, turn on the LED, dump the stack, and reset properly. However, much more often, it will "freeze", vector to the COP vector location, turn on the LED, begin to erase the EEPROM in preparation of the dumping the stack, and then freeze permanently.

This freezing during EEPROM erase is actually a second failure mode that may or may not be related to the first. By the control locking up during the stack dump, I believe I have simply exposed the first failure mode problem (the lock up issue in my first post), instead of masking it by a reset. So while I'd like to get the stack dump to work every time, it is simply a debugging technique and I am willing to ignore it for now. That's because I need find the source and solution to the first problem. (Hopefully the solution to the second will follow...)

That said, I do use the EEPROM on occasion during normal operation (once at start up to read system variables, and once at power down to save the state of any messages display). I don't think it is the cause of the problem because the lock ups do not occur when I am powering up or down. However, I would like to know more about your EEPROM issue, if you are able to divulge anything that might be useful.

mke_et · ‎01-07-2007

I HATE these kinds of problems. Once I worked on a laptop keyboard controller that the complaint was it would generate a reset after maybe 18-22 hours of heavy use. AARRGGHHH!!!! But I found it. Discovered it was hardware. And actually got a workaround fix!

Anyway, the first thing you need to do is find where the problem is occuring. Do you have access to the serial port? I'd start streaming status codes to a PC. In EVERY SINGLE SUBROUTINE, put an entry and an exit code, and stream them immediately. If nothing else, you can then look at the 'trap' on the PC and with a simple program see if you have a stack growth issue, or multi-thread issues.

If you feed your status to a serial buffer then realize that the problem may be occuring long after that status code actually occured, that other status codes may be in the TX queue when the freeze occurs. If you suspect that, kick the baud to as fast as you can handle and do it 'on demand', that is, no background send queue. The other option is to grab 2 bits and do something like an on demand I2C data send with a data and clock line, and set up a PC to just trap it with a decoder. Just realize that if you do the send 'realtime', it will impact your program timing.

If you see the problem happening after a specific sequence, now you know where to look, and even to start duplicating the errors.

Heck, if you do this 'on demand', you may find that the program is not 'locked up', just into a 'dedicated racetrack' where it's just looping and looping a set of instructions.

Hmm, you mentioned EEPROM... I had a similar problem with my EEPROM, where I was trying to read it after a write. Turned out I was getting in WAY too quick, and it was hosing... I never really looked into all the ramifications of the error, as once I realize what I was doing in the low level routine and fixed it, my other lockups all were solved, but it appeared I could hang the CPU with bad status. What was weird was it didn't seem to hang in the routine I did the bad stuff in, it would hang when another routine touched it. Doesn't matter, once I fixed it, the rest worked fine

By the way, one tip that may help. If you do the codes for entry and exit of every routine, do something 'simple' to flag them as exit/entry. Like set bit 7 on exit. That way code 01 is your routine, 81 is exit. 55 is entry, D5 is exit. That way you have a real simple way to scan your trap of the date for balanced entries and exits, as well as a visual 'look see' to see how things are going, as opposed to having to look each code up in a table. Doing this means almost that you don't care about the code until you find where it's hanging. And it's real simple to just count entry/exits so you know how deep you are in the stack.

Message Edited by mke_et on 2007-01-0709:16 AM

StephenRussell · ‎01-05-2007

These problems are very frustrating to find.

My experience with similar "lockup" problems with HC-12 and HCS-12 parts, but not the A64, is that the causes are:

An erratic clock, such as selecting the PLL before it is locked, or having a marginal crystal external clock.

Power supply out of spec or noisy.

Since it takes a long time to occurr, its the cause is probably a pretty rare event, so the signals may look fine except when the lockup happens.

Since you sometimes get a reset, you might be able to get some useful ideas by end triggering a digital scope on the fall of /RESET and looking at power and EXTAL pins.

If you can, enable the ECLK output and look at that. That gives you a signal derrived from the system clock which will reflect the health and state of the internal system clock

A similar triggering scheme with a logic analyzer would allow you to see the LED history to get a line on where the program is going when the accident happens.

I would also see if the problem can be provoked by having the MCU and crystal hotter or colder than normal.

Hope this helps.

Please keep us updated on you progress.

Lock Up Problem with 9S12A64 Part

Lock Up Problem with 9S12A64 Part

General