MCF5307, execution speed question

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MCF5307, execution speed question

1,526 Views
angelo_d
Senior Contributor I

Hi all,

i have a MCF5307 based custom board set with 45Mhz oscillator, 90Mhz CPU CLOK (x2) and 45 Mhz BUS_CLK (div. 2).

 

I am measuring through oscilloscope the length of some basic instructions (in time) to see if the execution speed is really 90Mhz, since i have some doubts on this.

 

move.w #AAAA, (ADDR_PADAT) ( i am producing a square wave on PADAT)

For this single "move.w #xxx, (A)" i get 360ns with cache disabled, or 132ns with cache enabled or internal dram.

This instruction should take 2 clocks, so i was expecting 11,1 nsec  x 2 = 22 nsecs.

 

Then i measured, always executing from internal ram, a single short jmp. It takes 50nsecs.

But i was ecpecting 11nsecs (1 clock).

 

Then i measured the output clock on PSTCLK (delayed CPU clock). It is 90Mhz, as expected.

 

Can someone maybe elucidate me on this long instruction timings ?

 

Many thanks

Angelo

Labels (1)
5 Replies

1,104 Views
TomE
Specialist II

It isn't the CPU timing you're measuring, it is most likely  the SLOW GPIO peripheral timing.

This isn't documented anywhere. This is something you have to reverse-engineer after deciding on a chip.

I've just checked the MCF5307 manual, and it doesn't document the timing.

This has been addressed in newer chips. Read the "Rapid GPIO" chapter for a chip like the MCF51QE128. The intro says:

The Rapid GPIO (RGPIO) module provides a 16-bit general-purpose I/O module directly connected to the

processor’s high-speed 32-bit local platform bus. This connection to the processor’s high-speed platform

bus plus support for single-cycle, zero wait-state data transfers allows the RGPIO module to provide

improved pin performance when compared to more traditional GPIO modules located on the internal slave

peripheral bus.


That doesn't really say that the normal GPIOs are glacially slow, but the term "slave peripheral bus" usually means "all the peripherals are on a bus bridge controller that connects the fast CPU bus to the peripherals, which run on a MUCH slower clock.


Type "slow GPIO" into the Search Box at the upper right of this page. One post will tell you about the known "12 wait states on GPIO access on MCF52255". The worst I've come across was an ARM-core chip that threw 200 wait states on GPIO because the CPU ran at 400MHz and the GPIO ran a multi-state state machine running from 10MHz.


I've just found an excellent previous reference, giving reasons, timing and references, here:


Re: excution time


I would paste some excepts from there into here, but you can't paste any more than one line into this forum without it going crazy. I hope they fix that bug soon.


Here's some one-line excepts and previous measurements:

I've measure on an MCF5329:

So a write to a port register takes about 76ns or about 18 CPU clocks and a "port |= bit" read-modify-write takes about 33 clocks.

"This is a limitation of the platform architecture used on the MCF528x, MCF523x, MCF521x, MCF522xx, and MCF5270/1/4/5. The PORTS module resides on the other side of a bus bridge that runs at half the platform clock rate.  That means that for a 150 MHz MCF5234, the PORTS block is running in a 37.5 MHz clock domain, and that doesn't help."

MCF523x quote:

"Platform peripherals (defined as FEC, UARTs, QSPI, I2C, and DTIMs) on the MCF523x run at the platform clock frequency, which is 1/2 the CPU clock frequency.  AFAIK, these accesses incur no wait states.

Off-platform peripherals (defined as eTPU, FlexCAN(s), PITs, EPORT, GPIO, and crypto blocks) run at 1/2 the platform clock frequency, which is 1/4th the CPU clock frequency.  As noted by others, accesses to the PORT registers require 3 clock cycles, a remnant of when we used to support port replacement units for emulation purposes."


If you want to measure CPU speed, program a timer to free-run and then read the counter before a series of instructions, then again after, and subtract the two. Make sure you also read the timer twice in a row and see how long reading the timer takes - it is likely to be as slow as writing to the GPIO is.


ARM and PPC chips have dedicated timers in the CPU that can be used for measuring CPU timing without these problems. ColdFire chips don't, so you have to rewrite the measurement code for each different chip and then "calibrate" it as well.



When you've taken your measurements of CPU speed and the GPIO Read and Write speed, post them back to this thread.


Tom


1,104 Views
angelo_d
Senior Contributor I

Hi Tom,

many thanks,

i am very surprised that changing level on gpios is so slow, at the end they are TTL circuits, and don't see why they cannot be fast to change level as a chip select can be, but i assume they are slow, as pointed out in your reply.

So for example. i take 2 (manual) + 3 for gpio access, we are a 5 clock, so 1x6 are 66nsecs, not still the 122 i am seeing.

Also, i am sure the remaining additional time i see in the square wave in the oscilloscope is the "jmp" instruction:

#ifdef CPU_SPEED_TEST

    asm volatile (

    "start:                    \n\t"

    "move.w  #0xAAAA,(%0)    \n\t"

    "move.w  #0x5555,(%0)    \n\t"

    "jmp     start            \n\t"

    : : "a"(MCFSIM_PADAT) : );

#endif


The "jmp start" "must" take one clock, but i see 44 nsecs, and it is still really strange as it is 4 times the expected.

Regards,

Angelo

0 Kudos

1,104 Views
TomE
Specialist II

> The "jmp start" "must" take one clock,

If you read section 2.7 in the Reference Manual you'll find that the BRANCH instruction takes one clock, but the JUMP takes FIVE clocks. So replace the "jmp" with a "bra" and it should go faster.

But read the note that says "Assumes branch acceleration. Depending on the pipeline status, execution times may vary from 1 to 3 cycles.". These are complicated chips, so YMMV.

Note in that section that BYTE and WORD instructions are often one clock slower than LONG ones, although not in the case of the "move.w #xxxx" you're using in your test.

You should try 4 or 5 copies of the "move.w" pair. You may be able to see the effect of the "jmp" more clearly then.

> i am very surprised that changing level on gpios is so slow, at the end they are TTL circuits,

Technically for the last 40 years at least they've been CMOS and not TTL. :-)

They're a lot more complex than "a bunch of gates". Most of them are complicated clocked state machines.

Many of the peripheral blocks (timers, UARTs, ADCs) are the same designs that were used on the old slow 8-bit micros. They were designed to connect to CPUs running on clocks of 10MHz or less. New micros have been made by replacing the old 8-bit CPUs with 16 and 32-bit ones, running faster clocks, with a "bus converter" connecting the multiple-hundred MHz 32-bit wide CPU bus down to the old and slow 8-bit or 16-bit peripheral bus. This may seem lazy, but doing this lets us reuse the software that controls those peripherals. As well, running the peripherals from a slower clock usually makes them lower power than if they ran from something faster.

Tom

1,104 Views
angelo_d
Senior Contributor I

Hi Tom,

many thanks,

yes, sorry, CMOS not TTL is what i was thinking.

About jmp, from manual, i don't see where you read those 5 clocks. If you see 2.7.5, (2-46) "jmp" to a label should be the "xxx.wl" case, so 1.

But i see here also now the "note" that says, if i understand well,  that there can be a variation from 1 to 3 additional clocks.

So we can reach the toal 4 clocks (44ns) i am seeing.

For me it is all more clear now.

Thanks Again

Angelo

0 Kudos

1,104 Views
TomE
Specialist II

> About jmp, from manual, i don't see where you read those 5 clocks.

It depends on whether the compiler/assembler is set to generate absolute or position-independent code (PIC). The Branch instructions are all PC-relative and therefore PIC. The "xxx.wl" form of JMP is absolute. If you're using that form then the execution time is probably 1 clock. But it might have been set up to generate PIC.

That "note" is a cut-down version of the one that I have in the "Version 3 Cold Fire Core User Reference Manual":

1. For the jmp <ea> instructions, where <ea> is (d16,PC) or xxx.wl, the branch acceleration

logic of the Instruction Fetch Pipeline calculates the target address and begins prefetching

the new path. Since the Instruction Fetch and Operand Execution Pipelines are decoupled

by the FIFO instruction buffer, the execution time can vary from 1 to 3 cycles, depending on

the amount of decoupling.


So if the previous instructions are "slow" and the pipeline is full, then it will run fast. Your ones writing to GPIO are probably "very slow". If the previous instructions were fast ones (register operations), or the cache missed or the memory system is slow or the code just missed the prediction on a previous conditional branch or any number of other cases, the pipeline might be empty and the instruction would be slower.


Unfortunately that note applies to instructions listed to take 1 clock and ones listed to take 5 clocks, so I have to assume that "vary from 1 to 3 cycles" means "5 to 8" in the case of the JMP instructions.


Tom


0 Kudos