MCF54450 V4e code optimisation

SMRSteve · ‎11-03-2010

Hello everyone.

I'm trying to optimise some video-compression code, and there seems to be information missing from the datasheet. The V4e can execute more than one instruction per clock cycle, but it seems to be blind luck if you can find a pair that can be run together.

A few quotes from the datasheet:

"Decode and select (DS/secDS) — Decodes and selects for two sequential instructions"

"there are certain, heavily-used instruction constructs that support multiple-instruction dispatch"
"folding two consecutive instructions into a single pipeline issue effectively creates zero-cycle execution times for certain instructions."

"Instruction folding involving MOVE instructions allows two instructions to be issued in one cycle."

This really is everything there is on instruction folding. The last quote hints at using "MOVE" instructions, but is it one MOVE, and one other instruction? A pair of MOVEs? Certain addressing modes only?

I've written a benchmarking routine to time different sequences of instructions in attempt to work out the logic. By trial-and-error, I've achieved a figure of 440 MIPS on a 240MHz MCF54450 (datasheet says "up to 370 Dhrystone 2.1 MIPS"). However, it's useless code, mostly consisting of MOVEQ instructions - although I can state for a fact that two MOVEQs can execute simultaneously.

I've also discovered that when every instruction (all single-cycle instructions) depends on the result of the previous one, 60 MIPS is the best you get!

From counting the instructions in my code, I can see that I'm running at about 200 MIPS for real-world software (up from 100 MIPS before I discovered just how slow the SRAM is!).

There must be some V4e core documentation that details instruction parallelism and dependencies that would enable me to write faster code. If anyone knows of such a document, some information would be greatly appreciated! Alternatively, maybe a copy of the code used to get the 370 Dhrystone 2.1 MIPS rating would give some clues?

Many thanks,

Steve.

TomE · ‎11-04-2010

> (up from 100 MIPS before I discovered just how slow the SRAM is!).

-- From the Reference Manual:

--7.1.1 Overview
--The SRAM module provides a general-purpose memory block that the ColdFire processor

-- can access in a single cycle.

So it SHOULD be fast enough. Unless you've forgotten to set the "V" bit in RAMBAR. If you haven't set it then the CPU is getting to the SRAM via the "Back Door" in the Crossbar Switch and that throws a lot of wait states.

The SRAM is normally so fast that you should put the CPU Stack in it (at least). You can put other structures in it you don't want getting flushed out of the data cache You can put small DMA buffers in it too and not have to worry about cache invalidations and flushes.

I can't find anything more on the Coldfire 4 Core than you have. Quite disappointing. There should be a "Coldfire 4 User Manual" like there is for the V3 core. Except that the only "coldfire3um.pdf" copy on Freescale's web site is corrupted. When I pointed this out in a Service Request the response was:

Sorry, the pdf you mentioned is quite old and we do not have a new version of it.

It looks like it was damaged when someone stamped "Freescale" on the Motorola manual on or about 1998.

https://community.freescale.com/message/59948#59948

Motorola documented the V3 core. It doesn't look like Freescale has documented the V4 one. You might be better off with a PPC core. Or an ARM core.

TomE · ‎11-04-2010

Google found the following information on the V4 core. From 1998. From Joe Circello at Motorola:

http://www.freescale.com/files/abstract/article/CFV4TURLEY.pdf

Maybe we could ask Joe to elaborate :smileyhappy:

http://www.linkedin.com/pub/joe-circello/7/65/59a

SMRSteve · ‎11-04-2010

Hello TomE,

First, I must thank you for prompting me to double-check the RAMBAR to make sure I wasn't accessing through the backdoor. I tried to get a backdoor access benchmark for comparison, and it came out the same speed. Long story short, my assembler is writing $C04 instead of $C05 for the RAMBAR, so after looking up the MOVEC opcode, I now have reasonable SRAM performance:

SRAM to SRAM copy (backdoor) = 129.6 MB/s

SRAM to SRAM copy (direct) = 351.6 MB/s
Cache to cache copy = 399.2 MB/s

So, still not quite as fast as the cache, but I now see no performance difference in my application code whether I use SRAM or cached DDR for the data.

This all actually relates better to my message on SRAM performance, so I'll post the solution there too.

Regarding core instruction performance, I do have the V2 manual, and I'd hadn't even tried the V3 manual. I've never seen a V4 manual either. I've read Joe's paper, and there's a bit more tantalising information, but no real documentation. Maybe I need to write something that tries all possible combinations of instructions and documents the results?!

I do now have the performance I need, we required 10fps VGA, and I can create a VGA-sized JPEG in 95ms, so I'm just about ok! It's a bit late for a different CPU as the boards are already built, and I think PPC would have been too expensive anyway (plus I'm no expert in PPC assembler).

I am still interested in code optimisation though and would welcome any further suggestions.

Many thanks,

Steve.

TomE · ‎11-04-2010

> I am still interested in code optimisation though and would welcome any further suggestions.

I just did some testing where I had to "rotate" a 240 by 400 pixel (16 bit) bitmap by 90 degrees in SDRAM.

The trick is to READ in columns (against the cache line direction) but to WRITE "with the cache" and with movem.l instructions to get the bursting.

The other trick is to make sure the column-address-step isn't a multiple of the cache size - specifically the column isn't long enough to start mirroring in the cache (it was). With jpeg I don't think you'll need to step that far, but watch out for things like that.

The best use of the CPU and memory would be to program the DMA Controller to burst to/from SDRAM and SRAM overlapped with the CPU processing the previous block in SRAM. The DMA controller is very flexible.

If you're DMA'ing a full VGA picture into RAM that will impact your throughput as it'll steal bandwidth from the CPU. The MCF5329 has a nice video controller (simple DMA from memory to LCD) built in and I've generated VGA frames on it.

SMRSteve · ‎11-04-2010

Hello TomE,

Thanks for the tips. Could you not do that rotate with the DMA? The modulo could be used to write the rows out as columns (or vice-versa), but I think you'd need to start a new transfer for every line. I don't know if it would be faster, but probably less CPU overhead.

I am DMA'ing a VGA picture to RAM, but it's from a camera chip, and Bayer-encoded (8-bits per pixel), so only 300kB. It's about 22% bandwith of the slow Flexbus, but only 1.2% on the DDR at 10fps (~3 MB/s). It's not a problem as my JPEG encoder only reads the data once (2.2%), and writes out the JPEG at about 40kB (0.16%?!). Anyway, If I 'scope the DDR chip select, there's hardly any activity there - less than 10% I would think.

I have looked at output of video, but the Flexbus is the bottleneck at 15M pixels/second (if 16-bit pixels & 16-bit bus) VGA at 60fps alone needs 18.4M pixels/second...

Steve.

TomE · ‎11-05-2010

> I have looked at output of video, but the Flexbus is the bottleneck at 15M pixels/second

You should be able to burst 16 bytes (8 16-bit words) at one word per FB clock (fsys/4 or /8). That's 120MB/s burst speed derated by what looks like 3 clocks per 8 transfers (8/11 of 120MB/s is 87MB/s).

Or you could spit your video out through the USB, Ethernet, PCI or ATA ports :smileyhappy:

If you could make your code more efficient by a factor of 2, 3 or more you could use an MCF5329 which has an internal LCD controller. We're using an external video DAC and can generate PC-standard analog VGA from it.

JPEG is DCT. Can the EMAC help at all with this? The manual claims "The first ColdFire MAC supported signed and unsigned integer operands and was optimized for 16x16 operations, such as those found in applications including servo control and image compression."

Here's a classic DSP library using the EMAC:

http://www.freescale.com/webapp/sps/site/homepage.jsp?code=CFDSPLIBRARY&fsrch=1&sr=10

AN3038 documents the EMAC and RSA.

This one documents code for DCT on the MAC (from the 1990s written by Joe again):

http://cache.freescale.com/files/dsp/doc/white_paper/MCF5XXXDSPWP.pdf?fsrch=1&sr=10

SMRSteve · ‎11-08-2010

Hello TomE,

You're right about the burst speed on the Flexbus, as I've just done some experiments:

8-bit port, no burst is 16/67 = 13.66 MB/s

8-bit port, burst is 16/22 = 41.61 MB/s

16-bit port, burst is 8/14 = 65.39 MB/s

32-bit port, burst is 4/10 = 91.55 MB/s

So easily fast enough for video (720x576p @50fps, 32bpp = 79.1MB/s)

I was just looking at the pictures on the datasheet, which show nothing longer than a 4 beat burst on an 8-bit port (ie. a longword). Now I've read the writing as well, it does say 16-byte bursts are supported on all port widths.

The turnaround time between bursts is consistently 6 clocks. This is with a DMA copy, so it could be the other burst to the DDR happening in this 100nS window. The pure Flexbus speed is probably a bit higher, but copy speed is a more useful benchmark.

The code is about as efficient as I can get it (without knowing the precise details of how each instruction affects the others around it). I'm already absolutely hammering the eMAC unit in the colour-space conversion and DCT routines.

As far as I can see, that DSP library doesn't have a DCT routine in it. And the DCT example code from Joe can't be serious for real-time performance - 189us for a DCT? - I've already got it down below 4us! I don't think looking for faster code is going to help much, and besides, it is now fast enough for what I need it to do.

Steve.

TomE · ‎11-09-2010

> I'm already absolutely hammering the eMAC unit in the colour-space conversion and DCT routines.

Impressive! Is the emac able to sustain 240MMAC in practical use?

How have you profiled your code? Do you have tools that show per-instruction contribution across the whole code base (like I have)?

Check your Private Email on this board.

SMRSteve · ‎11-10-2010

The eMAC will do one MAC instruction per clock tick, but even with four accumulators, you have to read the results out eventually, and this incurrs a three-cycle stall (in which you can put other non-MAC instructions). So I suppose 240MMAC/s is not entirely practical, I'd give it about 167MMAC/s assuming 16 MACs, 3 miscellaneous instructions (maybe to implement a loop), then 4 MOVCLRs to read out the results. The figure would be higher if you had a lot of coefficients to accumulate, but I find four (per accumulator) is a reasonable average.

Steve.

MCF54450 V4e code optimisation

MCF54450 V4e code optimisation

General