MCF5475 performance question

DavidHearn · ‎06-06-2006

We're trying to use a MCF5475 for some high speed data logging, so we tried doing a bit of a benchmark using a simple app and an oscilloscope and we're getting performance signficantly below that which we were expecting for a 266MHz/410MIPs processor.

As a basic test I wrote a simple application which basically was just a simple loop (important bits detailed below):

typedef struct
{
uint32 value;
unsigned char status;
}test_struct;

spy_struct source;
spy_struct destination;

while (1)
{
for (temp_loop = 0; temp_loop 1000; temp_loop++)
{
memcpy(&destination, &source, sizeof(test_struct));
memcpy(&source, &source, sizeof(test_struct));
}
// Set output to match (high)
MCF_GPIO_PODR_DSPI |= MCF_GPIO_PODR_DSPI_PODR_DSPI2;

// Set output to match (low)
MCF_GPIO_PODR_DSPI &= ~MCF_GPIO_PODR_DSPI_PODR_DSPI2;
}

Basically we're looping 1000 times, each time copying about 5 bytes of memory using memcpy (provided from Freescale sample code). At the end of that loop we set some GPIO pins high and then low again and repeat the loop. We then use the oscilloscope to measure the time it takes between each GPIO toggle.

We're seeing that it's taking:

a.) 1.5ms to do the whole process if we don't have any memcpy in the loop (just an empty for loop).
b.) 15ms to do the whole process if we have 1 memcpy in there
c.) 28ms to do the whole process with 2 memcpys in there.

Using a debugger, it appears that one cycle of the loop with a single memcpy takes about 60 instructions.

The difference between the empty loop and the 1 memcpy loop is about 13.5ms (for 1000 iterations). So that's 13.5us for 60 instructions which works out to be 4,440,000 instructions per second - 4MIPs.

Any idea of the factor of 100 difference between the value in the specs (410MIPs) and our example. I realise that each benchmark is different - but an order of 100?

Thanks

JWW · ‎06-09-2006

David,

Could you provide a few more details on how the hardware is configured? Example: Clock rates used (bus vs. CPU). Is the Cache enabled? Where is the code executed from?

The 547x and 548x family code execution speed can dramatically change based on memory used and whether the cache is enabled.

Thanks

-JWW

DavidHearn · ‎06-09-2006

Hi thanks for the reply.

I've since discovered the wonders of cache! I was using some Freescale example code as a skeleton for this basic application and after someone on the comp.arch.embedded newsgroup suggested I look at the cache settings I discovered that the example code disabled the cache.

After just setting the code, branch and instruction cache enable bits (I didn't understand enough about the other bits!) I found that performance increased about 10x. Turning on basic optimisation in gcc gave a further 2x improvement.

The code is being executed from RAM (downloaded via P&E's ICDCFZ debugger) BTW.

I've not looked at the clock rates. The board is a MCF5475EVB from LogicPD, no changes made. From the docs it says 66MHz system clock, 66MHz PCI, DDR memory at 133MHz and core CPU of 266MHz. I assume these aren't configurable except through hardware.

As an example of the speed now, I have a simple while(1) loop, which on each iteration, copys a 32bit value from a slice timer into a large array and also copys a 8 bit value from the DSPI status reg (setup as GPIO) into a similar array. It then toggles a DSPI pin and then the loop starts again.

Monitoring the time between output pin toggles rises, I've got the loop time down to 404ns (2.4 million loops per second, 2.4MHz). This includes the approx 50ns it takes for the pin to rise and fall again. I work that out to be around 125MIPS.

Still short of the 410MIPS, but I expect that is a very specific benchmark, whereas I'm copying data to RAM (and never reading it back), and also using GPIO pins - which I'd imagine are slow.

Does 125MIPS sound reasonable for this sort of operation? It's far better than the 4MIPS we saw before, and I'm guessing we're I/O and memory bound mostly now.

JWW · ‎06-09-2006

David,

Looks like you are now limited by the bus and memory accesses. You mention copying things to RAM, but I'm not sure which RAM...

Accesses to SDRAM will significantly slow down most benchmarking exercises.

At a high level, it looks like you are getting closer to a real number. The MIPS rating for CPUs are typically done with benchmarks that run within cache boundaries, so these type of benchmarks typically run code that tests a cache's branch prediction capability as well as a variety of other interesting performance metrics. The V4e core with its 32k I and D cache typically do turn in good scores when testing the CPU. In your case, it looks as if you are testing the I/O access time from the processor's bus to SoC based peripherals.

Most CPU style benchmarks use an internal timer, like the slice timers, to measure the execution time. So you'll get much higher scores if the test does memory read and writes to internal SRAMs and you time it with a slice timer...verses writing to a GPIO.

What are you trying to benchmark? If your final product needs to do a lot of I/O manipulation, then your approach seems logical so far. If you really are worried about CPU intensive algorithms, then I would suggest a different approach.

I would also check the compiler output to see how optimized it really is...

I'm always intrigued by benchmarking exercises... Keep us posted.

-JWW

DavidHearn · ‎06-09-2006

The actual application we're planning to develop will try and capture input pin state changes (currently the 5 DSPI pins on the ITX header connector setup as GPIO) over a short (~15 second) period, but at as high a rate as possible. We need to capture the timing of the pin changes, so using the slice timer as a reference (4 bytes). This therefore requires 5 bytes of data to be captured each iteration of the loop.

Ideally we'd like to capture at a rate close to 1.7MHz, which it appears might be possible as I've exceeded 2MHz on this simple benchmark (copying 5 bytes in a while loop). The only practical difference between the benchmark and the real app, is we don't need to toggle output pins, just read the input pins.

Capturing 5 bytes, 1.7 million times per second will consume around 9MB every second. For 15 seconds, that's 135MB, a little more than the 128MB max of a LogicPD FireEngine, but the time etc is adjustable.

Therefore due to the amount we need we're forced to use SDRAM for storing the data and not the faster SRAM

One option we might like to try, in a hope to improve memory usage, is to only capture on an external pin change which triggers an interrupt - this would stop us capturing the same pin state for a number of cycles of the loop, wasting memory.

What I've found though is that doing this exact same capture process (5 bytes) but triggered by a GPT expiring (counter set around 12, any less and I have problems), appears noticably slower. I guess that's the problem with interrupts, they have an overhead, whereas a simple while(1) loop has little.

Message Edited by David Hearn on 2006-06-09 09:55 AM

MCF5475 performance question

MCF5475 performance question

General