I can't figure out my Lost CPU time.

ynaught · ‎12-03-2021

The symptom I have been chasing for a couple weeks now is driving me bonkers!!!

I am running a custom (in-house) OS on a MCF54417 CPU.

Whereas there may be "better" ways to get the job done, my solution for consistent timing on my transmitted UDP frames is to have an interrupt routine (PIT3, configured to fire every 1ms) that watches counters to determine when to transmit and does so if appropriate. 10ms transmit interval is typical, but we'd like to go faster, closer to 5 or 3ms intervals. However I cannot seem to go faster than 10ms without failure BECAUSE...

Sometimes, my PIT3 interrupt routine (which normally finishes in about 500us when it *does* transmit, about 10us when it doesn't) takes 2 or 3ms to complete! I initially observed this by driving one of my serial port RTS pins active at the start of the ISR and driving it inactive at the end, and watching this signal on my oscilloscope. The task of "transmitting a packet" is simplified as much as possible: It copies a template frame into a buffer, then copies new data into the data portion of the frame and then fixes the UDP and IP headers/checksums before transmitting (in other words, the *complexity* of its task is unchanging).

In the course of investigating this, I realized that my "Realtime clock" maintained by the OS (using PIT0 @ 1kHz to count time) was *also* losing time (to the tune of 4:30 minutes per day!!!)

The Interrupts in my system are:
Int Src   Level    Purpose
PIT1      7          Supervisory (see ** below)
PIT0      6          TICK - maintains system timers and clock.
EPORT 6          Hardware interrupt from peer CPU
UART0 5         Serial port 1
UART2 5 Serial port 2
PIT3      4         UDP Data Tx function (the one I'm diagnosing)
ENET0 3         RxFrame - Mostly unused, evaluating time to send.
ENET0 2         TxFrame - Mostly unused, can timestamp Rx packets.
ICR62   2         Core Bus error handler (debug)

PIT3 handler has a low INT level (other Ints are more *urgent*) but as soon as its interrupt handler starts, it disables other interrupts by writing the SR to 0x2700. So, it should be impossible for this function to be interrupted by other interrupt functions (except for ones configured at Level 7).

** Thinking that another interrupt *must* be happening, I configured and enabled PIT1 (*only DURING the PIT3 handler*) to interrupt fairly frequently (about 62kHz) and had it peek up the stack to see if my PIT3 handler was somehow calling (or otherwise allowing to run) functions I did not intend to run. Every time the PC is in expected functions.

I also wondered if the Ethernet DMA'ing its frames to and from RAM could be affecting this. To evaluate that I triggered my oscilloscope on the aforementioned RTS envelope from the Int handler, and used another probe to monitor the RxD and TxD Clocks between the CPU and its Phy (actually a switch chip, but that's immaterial). I saw no correlation between the Ethernet activity and the "slowdowns". (In fact, the Enet frames were absurdly short in comparison to the Interrupt handler timeframe).

At this point I am stumped. Any suggestions for diagnosing this would be appreciated.

TomE · ‎12-04-2021

I think I've seen something like this before in this forum. I was able to find it searching. It is an interesting enough problem to have made it to an external blog:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/mcf5441x-uart-dma/m-p/287763

http://unreasonablerocket.blogspot.com/2013/04/finding-firmware-bug.html

The problem was that the CPU was prefetching past a not-taken branch that was checking for a null function pointer. When the pointer was null the CPU would still try to prefetch the NULL and the CPU would then die until the bus timeout timer went off. At which point, 0.5ms later, the CPU would start running again.

Something similar might be happening to you.

Do you have a debug pod attached? If you want to find out where the CPU is spending a lot of time, just keep stopping the CPU in the debugger and see if it is in some code more than you'd expect. After 100 breaks you might spot a pattern.

It could be stuck waiting for slow memory, slow IO or it might be looping in an interrupt routine because you're not clearing it properly. Put counters in al of your ISRs (and in *ALL* of your exception handlers) and dump the counts once per minute.

The MCF54417 should be running at "up to" 250MHz and that means 250 MIP. Unless you haven't set up the PLL properly in which case it might be limping along. Unless you haven't set up the instruction cache properly, in which case it may be running from external FLASH with 63 wait states at whatever clock rate the Flexbus is running at.

You're running Ethernet, so how are you doing cache coherency with that? Are you flushing, running with the data cache disabled or do you have the control rings and buffers in uncached internal static RAM?

There are lots of things to get wrong with these things. The first time I wrote code on a PPC (MPC860) without the caches set up it ran slower than the 68000 it replaced.

Tom