Debugging Unknown Runtime Errors

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

Debugging Unknown Runtime Errors

5,169 次查看
MichaelA
Contributor I
Hey,
 
I'm having some major problems with my software, trying to find strange runtime errors that are giving me no clue on where to start looking!
 
I am constantly collecting CAN Data and processing the data in an XGATE interrupt service routine.  However, when I turn the CAN off (ie, don't connect my CAN simulator), then the rest of the application works just fine.
 
But, when CAN is connected and processing data (ie, casuing CAN Rx interrupts on the XGATE), I sometimes get really weird runtime errors.  So far I've had the following errors:
- ILLEGAL_BP in the debugger (though I never set a breakpoint - I sometimes just leave the BDM connected).
- Program restart.  For some reason the program just seems to start over, even though theres no location in my code that says it should do this...
- I've had it ONCE go into the COP interrupt service routine (only ONCE!)
 
(note that every ISR has an endless loop in it, with a unique code that I can look at with LEDs to tell me the program crashed at that ISR)
 
The first 2 errors are pretty common and usually happen anywhere between 5 minutes to one hours of the application running (but the rest of the time it works just as it should).  Because it works until the error happens, I'm suspecting something weird is happeneing when a specific condition happens on the CAN bus, either a value that's not being correctly processed, or something...
 
Does anyone have any suggestions on how I could debug this beast and get this running flawlessly?  Has anyone had similar problems with their application when processing a lot of CAN data?
 
Any help or ideas would be greatly appreciated - its starting to get quite frusterating since it works so well, except for this sporatic error!
 
Thanks!
 
- Michael
标签 (1)
0 项奖励
回复
4 回复数

1,236 次查看
SteveRussell
Contributor III

Michael,

By providing service routines for all possible interrupts you've covered the most common causes.  Have you also got routines under the "reserved" interrupts that "should never happen"?

When you manage to fix the problem, please let us know what the fix(es) are.

Here are some additional ideas:

Make sure that the startup code you are using is writing to ALL the write-once registers, even if the reset values are what you want.  This makes sure that they don't get changed by misguided code in the running application.

Fill unused flash locations with something that stops execution faster than 0xFF.  Freescale AN2400/D "HCS12 NVM Guidelines" discusses various possibilities.  I think that they all will work for the S12X, but you should check.

The idea is that 0xFF is garbage stack pointer and proceed, which is not helpful, to say the least.  The various fill values stop execution in one or two cycles and don't garbage the stack.

Scowl at the clock signals and the power supply.  Glitches in the clock and power supply noise can cause completely unexpected operation of the MCU.   For the power supply, it may work to put a digital scope on the power pins and set it to trigger on higher or lower voltage.  You could try the same with the XTAL and EXTAL pins, but that may not work too well.

If you can cause a COP watchdog reset reliably when the problem occurs, you could trigger on RESET.

Enable the clock monitor,  if it isn't, but I don't think that it will detect glitches perfectly.

If you are using the PLL, you might have some subtle stability problems.  If the PLL filter doesn't have suitable values, the system clock frequency will wander around too much or too fast, and you could get unexpected CPU operation.  I've seen this with no PLL filter, but the working range for the filter is much wider than the Freescale PLL calculator indicates, so if your component selection is recommended by the PLL calculator and the actual values are close to what they should be, there should be no problem.

The on-chip trace can be set to trigger on accesses inside or outside a range, but you can only set up two ranges, so if your code is in several flash pages, you will have to make many runs, triggering on the unused gaps between pages. 

A bad solder joint or trace somewhere is a remote possibility, so tapping on the circuit board around the MCU might trigger the problem.  If it does, some experimenting might identify the physical area of the problem.

0 项奖励
回复

1,236 次查看
MichaelA
Contributor I

Hey Steve,

I'll give those suggestions a try.

I am also using a Semaphore to save data into a shared area of RAM - the XGATE part is the only place saving into RAM, but I also have the semaphore in the CPU part, where it only accesses the data.  Do I even need a semaphore at the CPU end if it only access, and does not modify, the data?  Could this also be causing a problem, that the CPU is holding the semaphore too long (it holds the semaphore for every byte through about 50 or so to send an on SCI bus - but at the same time, there may be incoming CAN messages to save to the same array...)

Thanks for the reply,

- Michael

0 项奖励
回复

1,236 次查看
Steve
NXP Employee
NXP Employee

Michael,

 The semaphore is needed when reading more than one byte/word at a time to ensure coherency of the message (you don't want to read one half of a message and then the other half from another message).

  You should minimise the time that each core holds the semaphore because you are blocking the other core. Can you use more than one semaphore?

  It's not obvious how that would cause the problem, but it may be worth a look.

0 项奖励
回复

1,236 次查看
Steve
NXP Employee
NXP Employee

Michael, the common errors are indicative of code runaway on the CPU.

Some thoughts:

1/ Could be unrelated to XGATE - could your CPU algorithm fail if the data coming from XGATE has a particular value?

2/ I assume you use XGATE to copy data into the RAM. Are you copying into the wrong location? Are you overwriting the CPU stack or destroying some other CPU info? Have a look at your CPU stack and variable contents to see if they are in range. Set up the RAM protection scheme.

3/ You may be able to get something from the trace buffer (CW 4.5). Have it continually running and see if you can extract a failure flow. Try having the trace cause a hardware break if you execute code from somewhere you don't expect

 

Message Edited by Steve on 05-11-200604:09 PM

0 项奖励
回复