Unexpected COP Timeout on ColdFire V1

markthedark · ‎04-20-2017

We are implement the COP feature on the MCFJE256 MCU and *appear* to be having an issue where the COP times out prematurely.

The following is our configuration which should give a timeout of 1.024 ms:
   COPW = 0 (normal mode)
   COPCLKS = 0 (uses 1 kHz LPOCLK)
   COPT = 3

To prove that the application is servicing the COP, the code that resets the COP timer (sequential write of 0x55 and 0xAA to SRS) is wrapped with a general purpose output being set and cleared. On the scope, it is very clear that the code is being executed every 250 ms.

Here is my watchdog reset macro:

 #define WDOG_KICK() DBG_YW_SetVal(), SRS = 0x55U, SRS = 0xAAU, DBG_YW_ClrVal()‍

When the timeout occurs, my debugger halts and the Debug window says "Suspended: Signal 'Halt". received. Description: User halted thread."

What is strange is that sometimes the issue occurs after the initial download, but not on subsequent restarts. Other times, the first run after the initial download is fine, but subsequent restarts fail. Each time, I scope that feeding of the COP as outlined above, and it still seems to writing the reset sequence.

Another thing that's strange is that it always seems to halt after the instruction that follows the writing of 0xAA to SRS. See the last line in the following disassembly.

447                 WDOG_KICK();
00012fd6:   clr.b d0
00012fd8:   bset d0,0xFFFF8002 (0xffff8002)
00012fdc:   moveq #85,d0
00012fde:   move.b d0,0xFFFF9800 (0xffff9800)
00012fe2:   moveq #-86,d0
00012fe4:   move.b d0,0xFFFF9800 (0xffff9800)
00012fe8:   clr.b d0
00012fea:   bclr d0,0xFFFF8002 (0xffff8002)   <--- halts here
‍‍‍‍‍‍‍‍‍

Lastly, if I disconnect and run without the debugger, everything seems fine.

Can anyone share any insight or experience with the using the COP in any of the Coldfire v1 parts?

Thanks,
Mark

TomE · ‎04-20-2017

Very unlikely, but make sure you're not in COP Windowed Mode.

Debuggers and Watchdogs just don't get on. Some chips have Watchdogs with "debug" settings. That means the Watchdog is halted while the Debugger is messing with the CPU (and stopping it from running). The MCFJE256 doesn't seem to have anything like that. You might have control bits in the debug module that manage this, but I can't find any.

If you want the ability to debug while the watchdog is running then you need to use a higher spec chip to get that feature.

Otherwise, work out some way to disable the COP when you're running on the debugger. Either have the debugger startup script actively write to the SOPT1 register to disable the COP, or have something in your startup code recognise "running under the debugger" and disable it there.

Why is it triggering? The Debugger is probably stopping and starting the CPU when you're not expecting it to. If it is stopping at that same instruction, then maybe you have an "untriggered breakpoint" on that location. Maybe a breakpoint is present, but it is set so the debug pod ignores it. But it may still be stopping and restarting when this happens.

You could have your main loop toggle a GPIO and then you'd be able to see if the debugger is stopping the code for a while.

Tom

markthedark · ‎04-25-2017

Hi Tom - Thanks for the reply.

From the JE256 reference manual, I believe the processor does manage/halt/reset the COP while the BDM is active, so I'm not sure this is the source of our troubles. Here's a snippet from section 5.3.1:

If the 1 kHz LPOCLK source is selected, the COP counter is re-initialized to zero upon entry to
background debug mode or stop mode and begins from zero upon exit from background debug mode or
stop mode.

From this, it sounds like I should be able to debug without issue.

If I remove the debugger and simply power up my target, I see that it still struggles to boot, leading me to think that this NOT just an issue with the BDM, but an overall issue with the COP system.

I added more "instrumentation" by configuring an output pins to toggle unique patterns at different times. At POR the output toggles 5x. When the reset pattern 0x55 0xAA is written to the SRS, the output toggles 1x.

With these patterns I can confirm that MCU resets several times before it is able to "get going". For example, the reset times are 715 ms, 464 ms, 715 ms, before the application is able to run, all of which are under the COP timeout of 1.024 ms. Within the reset times, the maximum time between the writes to the COP is 250 ms.

Once the target is running, I'm able to see from my target's GUI that the only reset source reported in the SRS is the COP bit.

I've attached an annotated screen capture from my logic analyser that illustrates the timing.

I can't seem to find any errata sheets for the JE256, which I find surprising. Are you aware of any erratas?

Thanks,

Mark

TomE · ‎04-25-2017

I missed that part on the COP interaction with the debugger. That is a simple and sensible implementation.

Your CPU is getting reset, that much is certain. I think we can trust the SRS register when it says the source is the COP.

That must mean that either the COP isn't being restarted properly, it is biting due to some other "violation" or the clock is running too fast.

The intervals between the CPU resets are a lot less than the expected timeout (1 second). If the COP restarts weren't working I'd expect the CPU resets to be 1 second apart.

The COP will reset the CPU if it is programmed in "Windowed" mode and the resets are outside the window. This is easy to check and isn't happening. The COP will also reset the CPU if any value other than "AA" or "55" is being written to the SRS. Is it possible you have an uninitialised variable or pointer somewhere in your code, and it is writing to the SRS? This could keep happening until the variable ends up with a value that has it not writing to the SRS address.

The "1kHz LPOCLK" could be running fast. Slow clocks (like 32,768 watch crystals) are notorious for being sensitive to any electrical noise, as noise pulses are seen as extra clock transitions. Check the power supply for any strange noise. Add some more capacitors. Power the CPU from something else. Check all the VDD and VSS pins (all 8) to make sure none are floating. Check for a capacitor on VREFO if the reference is enabled. Can you get to CLKOUT? If you can you can program SIMCO to output LPOCLK on that pin so you can watch it with an oscilloscope to see if it is running at the right speed.

The MCF51 parts derive from the S08 ones. It might be worth searching for keywords ("SOP", "LPOCLK") in all forums.

I can't find any Errata either.

Tom

markthedark · ‎04-26-2017

Hi Tom,

Thanks for the thorough list of suggestions. Your critical-thinking skills are *very* much appreciated.

I sent LPOCLK to CLKOUT and it always looks clean. The output frequency is about 1.023 kHz. I had a moment where the target really struggled to get going, and the output was consistent the entire time. We can rule that out.

You suggested checking to see if there is an uninitialized pointer writing an invalid value to SRS. I redefined my KICK_WATCHDOG() macro to be blank, disabled the watchdog module and then set a debugger watchpoint to halt on any writes to the SRS register. The watchpoint never halted. To be sure it was working, I manually wrote to that memory address through our background terminal and it did halt, meaning the watchpoint was working. As an additional check, I re-enabled all watchdog code, BUT changed KICK_WATCHDOG() to write an invalid value, and that restarted the MCU instantly, verifying again that the watchpoint is working. I *think* we can rule that out.

Power Supplies:

Our schematic shows that all VSS and VDD pins are not floating and we have the recommended 0.1 uF bypass caps on VSS1, VDD1, VSS2, VDD2, VSS3, VDD3, as well as VREFH and VREFL. I noticed the note about the 10 uF CBLK capacitor. We are using Digikey p/n: (CAP CER 10UF 25V X5R 0805), but the JE256 tower board uses 718-1121-1-ND (CAP TANT 10UF 10V 10% 1206).

We will take a look at cleanliness of the power supplies and VREFO pin and will get back to you shortly on that.

Thanks again, Tom.

Mark

TomE · ‎04-27-2017

I was only suggesting checking the power supplies on the assumption that it might be messing the LPOCLK up, but you've proved that it is stable. So it is unlikely to be a power supply issue, but it is certainly something worth checking.

So you've got me on this one.

I would still suspect something going wrong in the code somehow. I'd suggest checking the stack pointer initialisation, and also add code to zero the stack. My suspicion there is an uninitialised stack variable that powers up "bad", but eventually gets overwritten with something innocuous.

The way I'd go about trying to track this down, find the bug or prove that it isn't software is to start disabling (by commenting out) large chunks of your code and running with that. Run with a minimal "event loop" that pretty much just pats the watchdog and see if it can start up cleanly. Then start putting other parts of your code back in until it fails again. Then narrow down which bits of that code have to be enabled/disabled to cause the problem.

Another way to find where it is going wrong is to set up a high priority (IPL7) Timer Interrupt. Have the Service Routine capture the interrupted PC and stash it somewhere in Static RAM. The last stashed PC before the COP Reset (which you can detect from the SRS on the next reset) will be "close" to where the COP triggered. The ultimate version of tis causes an interrupt and then doesn't clear it. Since the CPU is guaranteed to execute one instruction on return from an interrupt, this effectively "single-steps" the code at 1/20 to 1/50th of the normal execution rate.

Otherwise it might even be possible that the CPU isn't executing the code you think it should. If there's something wrong with the FLASH timing or programming then it might be reading bad values occasionally on startup, providing the wrong instruction or address. Highly unlikely, but possible. Can you run periodic checksums of the code and data in FLASH to see if it changes after power-on?

I see the internal boot already does this unless the "checksum bypass" is set. Do you have it checking the checksum or not? Could BLMS be doing anything? You could also experiment with the bits in the CPUCR.

You should also check your main clock for stability. I assume you're using a crystal. Maybe it isn't properly stable for some reason, like the wrong gain or a DC offset. You can run the "clock test" (CCSCTRL) at startup and later to see if it gets different results. Send other clocks to CLKOUT and see if they all remain stable.

Tom