I'm trying to optimise some video-compression code, and there seems to be information missing from the datasheet. The V4e can execute more than one instruction per clock cycle, but it seems to be blind luck if you can find a pair that can be run together.
A few quotes from the datasheet:
"Decode and select (DS/secDS) — Decodes and selects for two sequential instructions"
"there are certain, heavily-used instruction constructs that support multiple-instruction dispatch"
"folding two consecutive instructions into a single pipeline issue effectively creates zero-cycle execution times for certain instructions."
"Instruction folding involving MOVE instructions allows two instructions to be issued in one cycle."
This really is everything there is on instruction folding. The last quote hints at using "MOVE" instructions, but is it one MOVE, and one other instruction? A pair of MOVEs? Certain addressing modes only?
I've written a benchmarking routine to time different sequences of instructions in attempt to work out the logic. By trial-and-error, I've achieved a figure of 440 MIPS on a 240MHz MCF54450 (datasheet says "up to 370 Dhrystone 2.1 MIPS"). However, it's useless code, mostly consisting of MOVEQ instructions - although I can state for a fact that two MOVEQs can execute simultaneously.
I've also discovered that when every instruction (all single-cycle instructions) depends on the result of the previous one, 60 MIPS is the best you get!
From counting the instructions in my code, I can see that I'm running at about 200 MIPS for real-world software (up from 100 MIPS before I discovered just how slow the SRAM is!).
There must be some V4e core documentation that details instruction parallelism and dependencies that would enable me to write faster code. If anyone knows of such a document, some information would be greatly appreciated! Alternatively, maybe a copy of the code used to get the 370 Dhrystone 2.1 MIPS rating would give some clues?