MCF52259 Overclocking

narcisnadal · ‎04-14-2022

Hi All,

Has anyone done a study on overclocking an MCF52259?

In my company I usually work with an MCF52259CAG80 board with 48MHz fundamental crystal, 80MHz internal frequency, external SRAM and recently spent a few minutes testing the PLL settings to run at 96MHz. I don't put the card in our 90ºC oven but the feelings were very good at room temperature.

It would be interesting to know if anyone has checked different crystals, settings to approach the theoretical maximum frequency of the chip, running in flash or internal ram, which must be much better than the 80MHz datasheets.

It's not about testing USB or ethernet, but only the mcu speed running on flash or iram

Thanks

Narcis Nadal

TomE · ‎04-14-2022

This would be an interesting thing for a hobbyist to try, but you'd never want to do it for anything in production. The part is only guaranteed for a specific range. Anything out side of that might work "in one batch at one temperature", but might fail randomly in the future.

If you need more performance out of the part, make the code better. It is very rare that you can't get some (or a LOT) of improvement by finding out where the CPU is spending its time and then making those parts better. Faster. Smarter. That can get you far higher performance than overclocking can.

For the parts I'm familiar with (MCF5235, 150MHz with cache and external RAM), moving things into internal RAM makes a big difference. For the MCF52259 doesn't have cache, but you ARE using externalS RAM. So make sure the stack (or if multithreaded, the main stack or the most used ones) are in INTERNAL SRAM. That gets you single-clock access to that data. Find variables that are being used a lot. Put them in SRAM.

I was working on code that ran on multiple different models of devices. At a lot of points the code had "if (IS_TYPEA) then ... else if (IS_TYPEB)...". Unfortunately "IS_TYPEA" turned into comparing a hardware number against SEVEN different values. Getting the Serial Number (which used a lot of these things) generated 35 instructions. Precalculating all of the options into a bitfield in static memory saved 30% of the event loop execution time! The code was spending 1/3 of its time just working out what model it was.

Check the "memcpy()" function. A lot of them are really slow and reduce to "byte at a time" when misaligned. This CPU supports really fast copies using MOVEML. I've usually written my own "custom memory move" for when it matters. Otherwise, stop moving memory around. Code it so it uses data where it is. For DRAM, the fastest way to copy data from DRAM to DRAM is via the SRAM. Yes, copying it twice is faster than copying it one. This doesn't apply in your case.

It is worth fully profiling the code. If you have the right debugger you should be able to do this easily. Something as simple as running with the debugger and stopping the CPU a lot of times (to get lots of data points) and then record where in the code you've interrupted it the most. Then find out why it is there and if you can make that part more efficient. With external RAM you could do what I do, which is to have an array the size of the code and run a fast (100kHz) interrupt sampling the PC and incrementing the array according to the PC interrupted. Then combine that with the symbols in the MAP file to find which functions it is spending the time in.

Minimise floating point arithmentic. Minimise use of "divide" especially in code executed a lot. Use 16 bit integers where you can as you might get the compiler using the CPU's "MUL" instruction instead of calling the multiply libraries. And so on.

We never did work out what this problem was:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/MCF52259-overheated/m-p/215614

Tom

narcisnadal · ‎04-16-2022

Thanks Tom for your answer, in fact I have my own memmove() who checks if the blocks are in multiples of 2 or 4 bytes and the addresses are paired to get profit of the word or long move, without MOVEM L/W , which is not interrumpible.

I have tried also with the DMA to run concurrently with the ALU, wit good results.

Also copying code to internal ram, sometimes with surprises in execution time depending upon the address where the code is located, or may be my imagination.

I usually work in asm with fast inner loops, with stack and variables in internal RAM, and just need 20-30% more power.

I think to know what are you saying, that the piece is certified at 80Mhz @ 85ºC, but I suspect than the execution can speed-up in flash and much more in internal ram, thus I am evaluating the possibility of changing a little bit the PLL parameters for flash or a little more in some inner loops executed in iram.

For checking the speed of a function, I use a timer with ns resolution, I imagine is that what you are saying.

Narcis Nadal

TomE · ‎04-16-2022

> in fact I have my own memmove() who checks if the blocks are in multiples
> of 2 or 4 bytes and the addresses are paired to get profit of the word or
> long move, without MOVEM L/W , which is not interrumpible.

The first thing the code should do is check to see if the copy is long enough (16 bytes or more) to bother with running more code to work out how best to do the copy. Then the usual thing is byte-copies until source or destination (doesn't matter in your case) are 4-byte aligned and then do as many 4/8/16 unrolled or MOVEML copies as it can, before finishing up the end with word and then byte copies until finished.

You don't worry about misalignment with this CPU. It can perform misaligned transfers faster than anything you can do. If you're performing 8 or 16-bit copies because of 8 or 16-bit misalignment, then you're going really slowly for no reason.

What's the problem with not being interruptible? A four-register MOVEML takes one clock for the instruction fetch and four for the reads or write. Lots of normal instructions take the same. MOIVEC takes NINE clocks. DIVSW takes 20 and DIVSL takes 35. Your function preambles are probably using MOVEML to stash the registers on the stack they're going to use in the function. At least MUL only takes 4 clocks on this CPU. Can you put the EMAC to any good use?

> sometimes with surprises in execution time depending upon the address

That may be the compiler doing something unexpected that makes it take longer. Are you using CW or are you using GCC?

You might be able to get your extra speed by using gcc (with it unrolling everything).

> but I suspect than the execution can speed-up in flash and
> much more in internal ram,

80MHz is hard. The higher clock rates you see in more modern parts are hard won, and often due to pipelining. Yes, modern DDR3 runs at 266MHz, but it takes 6 or more clocks before a read returns data. This SRAM is returning data on the SAME clock. Go look for CFMCLKSEL. Note the Flash is already set at the factory for 1-1-1-1 or 2-1-1-1? Someone's been working hard to make it as fast as possible. Note how the FLASH is double-bank-interleaved? That means its access time is really 40MHz and it reads 64 bits at once to be able to read at 80MHz. Also the FLASHBAR[AFS] bit where it prefetches? That means it is trying hard to keep up. That means two things - linear code is faster than code with a lot of branches (where the prefetch fails) and "they're already pushing it at 80MHz". I wouldn't try to go any faster EXCEPT as a "hobby" and not for "production". Remember it is going to start by randomly corrupting data rather than hard failing when it can't keep up at some temperature or some data access pattern.

> For checking the speed of a function, I use a timer with ns
> resolution, I imagine is that what you are saying.\

What I'm saying is that I have an interrupt service routine sample the program counter, and then print out the results, like this:

log_filt_init_normal      1034  0.54%
memcmp                    1074  0.56%
mip_udpconn_init          2141  1.12%
can_comm                  2255  1.18%
mip_pbuf_decref           3274  1.72%
pit0_isr                  3336  1.75%
do_pin_cal                3444  1.81%
get_time_us               3620  1.90%
vimcom_cycle              7315  3.85%
log_usb_data_send         8458  4.45%
log_to_flash              9311  4.90%
main_loop                 10279  5.41%
memcpy                    10919  5.74%
fec_rx_isr                21048  11.08%
main_proc_loop.lto_priv.383 93529  49.23%

They're the function names on the left. The next column is the number of samples. The percentage is the execution time of that function as a percentage of the whole. The Ethernet interrupt is the function taking the most of the time (the 49% item is the idle loop).

The raw data (that is captured and is used to generate reports like the above)) looks like this:

0x80130734 = 70
0x80130736 = 748
0x8013073a = 3
0x8013073e = 166
0x80130742 = 83
0x80130746 = 538
0x80130748 = 3
0x8013074c = 460
0x8013074e = 47
0x80130752 = 33730
0x80130756 = 114427
0x80130758 = 9881
0x8013075e = 31205
0x80130760 = 21261
0x80130762 = 9375
0x80130764 = 39
0x80130768 = 26
0x8013076e = 388
0x80130770 = 18
0x80130774 = 27
0x8013077a = 273
0x8013077e = 110
0x8013078c = 28

That says the instruction at "0x80130756" is taking the most time (it was sampled 114427 times) so then I look at a disassembly to find out why that one is taking so long. That lets me work out exactly where within an instruction the time is going. I can then make changes and measure the different that change made. The above often finds that the function preamble and postamble (pushing registers to the stack and getting them back) takes longer to execute than anything else in some functions. That's why it is worth in-lining simple functions if the compiler isn't smart enough to do that automatically. Here's an example of that where the samples per instruction are in the left column (287 clocks taken by "moveml" on previous line):

     801110fa <rijndael_ecb_encrypt>:
59:  801110fa: 720a moveq #10,%d1
42:  801110fc: 4e56 ff94 linkw %fp,#-108
56:  80111100: 226e 0010 moveal %fp@(16),%a1
56:  80111104: 48d7 3cfc moveml %d2-%d7/%a2-%a5,%sp@
287: 80111108: 206e 0008 moveal %fp@(8),%a0
60:  8011110c: 2029 01e0 movel %a1@(480),%d0
59:  80111110: b280 cmpl %d0,%d1
9:   80111112: 6700 063a beqw 8011174e <rijndael_ecb_encrypt+0x654>

Tom

narcisnadal · ‎04-17-2022

The method of copying with MOVEM is interesting and I may use it in the future, I will test if it is better than with DMA.

To your question, I work with CodeWarrior, which is complicated with Windows 10 because I have to open a virtual session to be able to use the BDM.

I program in ASM the fast loops.

What is also interesting is the sampling of the PC, but it must take up a lot of memory if you have an instruction level resolution. Do you have a lot of external memory ?

Maybe I should compile some functions with GCC, because I don't want to buy it, now that we are reaching cpu maturity. Is there a free version that you want to advise me ?

Thanks a lot for your answers.

Narcis Nadal

TomE · ‎04-17-2022

Previous posts on GPIO and peripheral register delays, FlexBUS delays and DMA:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Flexbus-delay/m-p/226366

Other things to check with SRAM and external buses:

https://community.nxp.com/thread/62389

This one says accessing GPIO registers takes 12 CPU clocks - that may be happening to your peripheral register accesses:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/overcoming-the-12-cycle-GPIO-waitstate-fo...

GPIO specific measurements on the MCF52259:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/MCF5225x-peripheral-access-speed/m-p/1674...

Read the quote from John Bodnar in this one:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/excution-time/m-p/155301

Tom

TomE · ‎04-17-2022

Narcisnadal. You had a post to this thread about 15 minutes ago, and now it has disappeared. Maybe you deleted it. Here's some answers to it anyway.

I see you've been using this chip since 2011 (in the overheating thread). It looks like there was probably bus contention/overlap between the CPU timing and those of the 74AHCT574 causing the overheating. That should be visible with a good oscilloscope if you want proof. Anyway.

What was that about MOVEML and interrupts again?

DMA controllers are surprisingly slow. They are really only good for copying to and from peripherals, replacing interrupt driven data transfer. So if your CPU has to wait for the DMA transfer to complete, it is taking way too long. The CPU can always copy data faster. Does this chip have a crossbar? Yes, that's the Arbiter in this chip. That means that the DMA controller can be transferring to and from external RAM during the same cycles that the CPU is running from Flash and accessing the internal SRAM. If the DMA controller is accessing the internal SRAM then those cycles might block the CPU unless the CPU is using one half while the DMA is using the other.

It is also very common for peripheral register reads and writes to take a LOT more CPU cycles than you'd expect due to intervening bus controllers and the peripherals running on slower clocks than the CPU. So it may take a lot longer to load the DMA controller registers (to set up the memory copy) than you might expect (like 40 or 50 clocks or more). I'd suggest you measure this with your nanosecond timer.

Are you performing any CRCs or checksums of network data? That can always be optimised. You can perform checksums while copying and not as a separate cycle.

How slow is your external SRAM? Is it 8 or 16 bits wide? How many clocks for a 32-bit read and write? Avoiding those would be good (but the DMA controller might be a good match for copying to and from that memory).

The QSPI is really good. If you're moving a lot of SPI data, don't have the CPU move one word at a time while waiting. Set up the QSPI and let it do that and interrupt when finished.

GCC is free. We run it on Linux to compile for out Coldfire chips. That's a lot of work to set up.

I'm working on an MCF5235 running at 150MHz CPU speed with 75MHz bus clock with code and data cache, 64k internal SRAM, 512M external 32 bit wide Flash and 16M external 32 bit wide DRAM. With about 900k of firmware in Flash but executed out of DRAM.

Tom