strange QT4 behavior

irob · ‎10-19-2006

Hey all,

I've got some QT4's on a mature design that are behaving strangely. On a controlled test environment, I'm seeing a few boards fail at random times while others from the same batch are not. All QT's are from the same date code batch (RMCGWm).

The method of failure is the same. I have input switches that are "freezing up". In other words, immediately upon power up these switches are not functioning when they very clearly should be. I have (what I thought to be) very stable state machines and adequate COP timer resets such that these switches are polled constantly.

Unfortunately, it's difficult to debug this problem. I have the QT4 demo board, but that requires porting the code into that board. Plus, that is different hardware, so I could be chasing a phantom bug that might be hardware related.

After swapping out boards, I no longer saw the problem. Here's the question: what is it about this problem board that would cause this? I've already tested for intermittent shorts on the inputs. I'm thinking it's something about the microcontroller. Any thoughts?

bigmac · ‎10-25-2006

Hello iRob,

It is not entirely clear whether your 'freezing up" problem is a permanent fault, or can be cleared with a POR.

I generally adopt a particular strategy to ensure that I/O control registers cannot remain corrupt - a strategy that I have never actually seen advocated by Freescale in sample programs, etc. In most code that I see, the initialisation of I/O registers (data direction and pull-up enable) seem to be done only once after a reset. It is then typically assumed that these will remain uncorrupted until the next reset (assuming the program code has no need to change them).

To allow for possible corruption to occur, by whatever cause, I would usually "re-state" the required register values as the first task on each entry to the main program loop (followed by reset of the COP timer). This would also apply to any other control registers that influence the way the I/O pins behave, and are not updated within other routines, after initialisation.

I see this as a precautionary measure that does no harm, and requires minimal overhead.

Regards,

Mac

irob · ‎10-25-2006

Mac, the button "freeze up" problem can be corrected with POR. So yours is an interesting solution. I'll give it a go and see if we have this anymore.

irob · ‎11-08-2006

Here's an update:

I've got a version of my firmware written that re-writes the data direction registers each time at the top of the main program loop. Regardless, on some boards the problem of "button freeze" is still reproducible. I'm going to hang an LED on an output pin to try to debug this further.

Also, my COP timeout has always been set to the long cycle (262,128 x BUSCLKX4) in the COPRS bit of CONFIG1. Is there any other register I should be setting for longer COP timeout?

Also, if my COP is disabled, should I remove every instance of __RESET_WATCHDOG();?

bigmac · ‎11-09-2006

Hello iRob,

On the "freeze up" problem, have you been able to identify that the firmware is still operating correctly, aside from the lack of response from the pushbuttons?

Does the project use an LED, or do you have a spare pin that an LED could be connected to, for diagnostic purposes? If so, you might modify the code for the following mutually exclusive tests -

1. At the completion of your initialisation code, flash the LED for a period of 100-200 milliseconds (making sure the COP is reset periodically during the delay period). You will be able to determine that the POR occurs properly, and that the remainder of the code does not cause COP timeout, i.e. only a single flash on power-up.

2. Set up a timing register that the timer overflow ISR will decrement each time it is entered, provided its value is not already zero. This will provide a timeout period in multiples of 20 milliseconds (for a TIM prescale of 1). Then at each commencement of the main loop, test for timeout (zero register value), and if so set the register for a suitable delay period and toggle the test LED. If the LED then continuously cycles, this will verify that the firmware remains operational, and does not head off to places unknown.

If both tests were to verify correct operation when the pushbuttons cease operation, this would localise the problem to the push button code. Are you using keyboard interrupts for this purpose, or are you polling the pushbuttons?

Regards,

Mac

irob · ‎11-09-2006

I'm not using keyboard interrupts, I'm just polling with state machines.

These are all great suggestions. I will look into them very soon. In the meantime, I think I may have a successful update. I've enabled LVI on these problem boards and this button issue (among others) seems to have disappeared. More info to follow...

bigmac · ‎11-09-2006

Hello iRob,

I had assumed that the LVI would be enabled, and am surprised that you would have disabled it. This is now very suggestive of a power ramp up problem, as previously explained by Alban. What voltage regulator do you use, and what are the input and output capacitor values? During your tests, did you apply power by switching the DC voltage, or by switching the AC mains voltage?

Regards,

Mac

irob · ‎11-09-2006

Well, I have begun using LVI on other more recent firmware. This button bug started showing up on an older set of firmware, where LVI was still disabled. Never fear, LVI (or LVD in the case of 9S08's) are now my firm policy.

On all these targets, I've been using small SOT23 LDOs from TI, TPS76301. I've got 0.1uF caps on the input, 10uF caps on the output. The input to my target is +5V.

For testing, power cycling varies. In my software test environment, I'm using a bench DC supply, so I typically switch that off (still technically DC switching). In the final assembly lab, our technicians use batteries to cycle power.

peg · ‎11-09-2006

Hi Rob,

There is only two choices of COP timeout here, controlled by COPRS in CONFIG1 like you say.

As for the other question.....

Kick a dead dog and you get.... well, a dead dog!

Set a broken clock and its correct twice a day!

It will just waste a few cycles.

Regards

David

Alban · ‎10-22-2006

Hi,

Have a look at the Vdd rising profile.
If you have high capacitors on the power supply, the time to charge them may mean Vdd rise-time is too slow.

This would cause code run-away, matching the observed behaviour.
A way to check this would be to see, when you boot it up, if the LVI bit in the SRSR is set.

With a normal fast power supply, you are likely to only get POR and PIN bits. PIN would come from the reset line kept low by an RC filter.

Alban.

bigmac · ‎10-23-2006

Hello Alban,

Alban wrote:

Have a look at the Vdd rising profile.
If you have high capacitors on the power supply, the time to charge them may mean Vdd rise-time is too slow.

This would cause code run-away, matching the observed behaviour.

I am curious as to the run-away mechanism - does the reset vector not correctly load to the PC? Since the LVI module is initially enabled by default, doesn't this offer some protection if the voltage is still too low after the clock stabilisation interval? Do you know of any application notes that detail this possible run-away phenomenom?

Regards,

Mac

Alban · ‎10-23-2006

Hi Mac,

With a very low voltage, the POR is kicking in on top of the LVI.

The POR will kick at about 1V, but Vdd needs to go below 100mV for rearming the circuitry.

So you have a possibility of code run-away in condition described in the figure of page 4 of AN2640.

I also think that if the Vdd rise time is not fast enough, you can get some normal latch-up effects.

The minimum Vdd rise time is characterized and stated at the electrical specification.

Alban.

AN2640 (Page 4)
HC908QY4/QT4 Microcontroller (MCU) Application Hints

AN2105 (Page 2)
Power-On, Clock Selection, and Noise Reduction Techniques for the MC68HC908GP32

EB398 (other interest)
Techniques to Protect MCU Applications Against Malfunction Due to Code Run-Away

AN1744 (also of interest)
Resetting Microcontrollers During Power Transitions

irob · ‎10-24-2006

Alban wrote:

EB398 (other interest)
Techniques to Protect MCU Applications Against Malfunction Due to Code Run-Away

Great stuff, Alban. Very intersting. I'm particularly curious if this code-runaway situation might possibly be affecting my other issues with QT4s. See here.

Listen to this from the above Engineering Bulletin:

In applications which include non-volatile memory there is a possibility that its contents could be corrupted by uncontrolled behavior of the MCU. This would
be particularly serious in the case of Flash or EEPROM memory that contains
application code. If this is corrupted, the application may be rendered nonfunctional
with no possibility of recovery short of reprogramming.

That describes exactly my situation with regards to lost jump vectors.

Alban · ‎10-24-2006

Hey iRob,

I've reacted on your vector table erase before reading this.

I don't think this point is related to your application.

You can screen this out by watching your power supplies.

Even if the EB explains what is possible, it would mean that your code running away always go and execute the Flash Erase routine as you don't observe other erratic behaviour, it would be bad luck

Furthermore, when you remove the use of erase routine, you still get the failure. Meaning the code running away would have to jump to the start of the ROM erase routine 100% of the time, or to a part of the memory changing the Flash programming registers accordingly...

My bet is on the COP !

Alban.

irob · ‎10-24-2006

But listen to the second workaround listed in the MSE908QY4_3L69J errata:

Disable the COP watchdog, or use a timeout period that is longer than the page
erase time (4 ms) and write to $FFFF immediately before and after doing any
page erase.

I've disabled the COP on all versions which use the ROM-resident erase function. But I'm still dealing with product returns which exhibit the symptom of erased reset vectors.

Alban · ‎10-25-2006

Yes, it's because the writer probably meant "don't use the COP or if you want to use it, make sure its time-out is long enough and you don't refresh it during the page erase."

strange QT4 behavior

strange QT4 behavior

General