Coldfire V1 runtime LongLongCF.C - slow & bug

howardheid · ‎10-24-2012

Has anyone found a usable and faster runtime library or user includable code

generally available for the Coldfire V1?

Conversion of an old 08GP32 project to a Coldfire V1 (the MCF51JM128VLK)
has resulted in dissapointingly slow performance (the GP32, not to mention the
GT60, outperforms the V1). The V1 runtime library is optimized for the smaller
memory, higher speed, Coldfires and is just way too inefficient for the V1s.
Searching the web found no other runtime library or user code available to
replace these originally written 68K libaries.

The simplest runtime module to rewrite was the LongLongCF.C as it is one third
asm CINT64 routines, one third C CINT64 division routines, and one third C
FP convert routines. Especially as the compiler seems to predominatly do the
32-bit math using the LongLongCF.C CINT64 routines. Also, the compiler allows
overloading of the LongLongCF.C routines with the user's just by having them in
the project files.

The result of replacing just the asm routines (none of the C routines) was a
33% improvement in the project's critically timed code section (760+ to 510+
uS). Also, the __rt_rotr64() routine was discovered to be in error (the first
BRA.S instruction goes to the third loop label when it should have gone to the
fourth label). (Searching the web indicates that it possibly occured as far
back as 1996 in the original 68K code.)

The new code is proprietary but the performance benefits are:

extern asm ABI_SPEC short __rt_cmpu64( CInt64, CInt64 );
    Size : 36                30
    Speed : 15,23; 22; 13,22 20; 18; 18
    Rating: 828               600

extern asm ABI_SPEC CInt64* __rt_eor64( CInt64*, CInt64, CInt64 );
    Size : 30                32
    Speed : 21                19
    Rating: 630               608

extern asm ABI_SPEC CInt64* __rt_mul64( CInt64*, CInt64, CInt64 );
    Size : 80                160
    Speed : 608               167
    Rating: 48640             26720

extern asm ABI_SPEC CInt64* __rt_neg64( CInt64*, CInt64 );
    Size : 26                24
    Speed : 17                15
    Rating: 442               360

extern asm ABI_SPEC CInt64* __rt_rotl64( CInt64*, CInt64, short );
    Size : 50                76
    Speed : 592               38
    Rating: 29600             2888

extern asm ABI_SPEC CInt64* __rt_rotr64( CInt64*, CInt64, short );
    Size : 68 (ERROR)        76
    Speed : 655 (ERROR)       38
    Rating: 44540 (ERROR)     2888

extern asm ABI_SPEC CInt64* __rt_shl64( CInt64*, CInt64, short );
    Size : 46                64
    Speed : 403               31
    Rating: 18538             1984

extern asm ABI_SPEC CInt64* __rt_shrs64( CInt64*, CInt64, short );
    Size : 52                66
    Speed : 592               32
    Rating: 30784             2112

extern asm ABI_SPEC CInt64* __rt_shru64( CInt64*, CInt64, short );
    Size : 52                64
    Speed : 592               31
    Rating: 30784             1984

extern asm ABI_SPEC CInt64* __rt_sltoi64( CInt64*, signed long );
    Size : 30                20
    Speed : 16                13
    Rating: 480               260

extern asm ABI_SPEC CInt64* __rt_ultoi64( CInt64*, unsigned long );
    Size : 16                16
    Speed : 12                11
    Rating: 192               176

Speed is the worst case of a routines possibly multiple timings. Rating is the
product of size and speed for equally critical resources (as in Golf, the lower
the score the better).

\

The __rt_mul64() is NOT the best (it can be further improved by possibly 20
bytes and 10 cycles).

The above is presented as a challenge to get needed improved Coldfire V1
runtime libraries since no alternatives seem to be available. (NO

IMPROVEMENTS SINCE 1996?)

TomE · ‎11-04-2012

The offer still stands to compile some of your code with gcc.

Are you using floating point? Was the old 8-bit 8MHz CPU able to emulate floating point faster than the 50MHz and 32-bit CPU could do the same thing? There's something very wrong there.

I would have sent you a private message like the old forum would, but this new one doesn't allow private messages any more (unless you're "connected", whatever that is).

Tom

howardheid · ‎11-09-2012

The 08s used 16-bit integer over-all with just one calculation done in an optimized custom 40-bit floating point. The CFV1 critically timed code section uses 32-bit integer with 3 calculations done in 32-bit floating point. The non-critical timed sections use a mix of integer and float (mostly for formatting by printf).

The posted original sizes and cycles are for LongLongCF.C CINT64 asm routines in the CodeWarrior V6.2 Coldfire Support runtime library. The other sizes and cycles are for the new CINT64 asm (integer only) code. Examining the generated listing showed that the 64-bit CINT64 routines were being called to perform the 32-bit calcs (add, subtract, and multiply (integer divide was avoided)). The LongLongCF.C CINT64 multiply routine does both signed and unsigned mutiply by extending the 32-bit value into the CINT64 structure (zero for unsigned and zeroes/ones for signed) then calling the LongLongCF.C CINT64 multiply routine with the truncated CINT64 result giving the "right" answer. The multiply, divide, rotates, and shifts all operate as bit-bangers (one bit at a time), they do not use the Coldfire instructions to perform register sized operations.

As for the LongLongCF.C floating point conversions, the code is also bit-banger using two float operations per bit. The cycle count could be as high as 1000-2000 or more. As a guess only, the unsigned int to f32 conversion may be doable in 75 +-25 cycles when written for the CFV1.

The integer code is unavailable and I am currently trying to evaluate the floating point converts.

Due to a conflict with the hardware designer (who chose the CFV1 some ? 7 ? years after others had rejected it (myself included)), I am detached from that project (and probably out of a job by years end). I am persueing this on my own time and little/no hardware for testing. Has anyone seen/found/built a runtime library replacement for CFV1 usable in or with CodeWarrior ???

TomE · ‎11-10-2012

You'd be better off posting in the CodeWarrior forumm, as that's really your problem. This forum is more hardware related.

Have you checked the compiler optimisation level? It may be calling all those slow functions because it is optimising for size over speed.

A quick browse of the CodeWarrior forum finds:

CodeWarrior and ColdFire get optimized!

Which points to:

http://www.freescale.com/files/soft_dev_tools/doc/app_note/AN4316.pdf

Another point. Highly optimised code is usually very hard to debug as the source lines lose correspondence with the assembly code. If the IDE has defaulted to settings to make the code easier to debug it may explain the poor code you're seeing.

I think one of the advantages of the CPU family you're using is that it can be swapped for the ARM-based Kinetics chips. So your project might be able to change to one of those without having to change the hardware.

Tom

howardheid · ‎11-13-2012

To see the Coldfire Support Runtime library code use

DIR /S C:\LongLongCF.C

From what I have been finding on the Web concerning CW & GCC neither is considered "best". I have not seen the gcc library so I don't know if it would be better to use. However, that is not my decision. I was limited to including code that is CW overloadable - meaning it had to be type, name, and parameter identical to the CW Runtime library so the linker would use it inplace of the library code.

The one floating point calculation was doable but as fixed point would have required multiple 128-bit values for multiply and divide (way to much ram, code, and time).

The LongLongCF.C CINT64 multiply routine does both signed and unsigned mutiply by extending the 32-bit value into the CINT64 structure (zero for unsigned and zeroes/ones for signed) then calling the LongLongCF.C CINT64 multiply routine with the truncated CINT64 result giving the "right" answer. -- One multply, not signed + unsigned multiplies.

May be, but the one in charge insisted he could do better. Besides, this Coldfir library IS derived from the 68K library. And where that came from I don't know.

We used to use others (including P&E Micro's CASM), but the company wanted CodeWarrior and it did have the macro capability to allow absolute code from relative source files, some safer & flexible memory object difinitions, instruction fixes, and byte-bit object binding (once I figured out how to code them - simple, usable, and caught errors durring code generation).

The hardware designer had assumed control of the 32 & 8 bit projects and shut down the 8-bitters as his would replace everything and supposedly be easier to program and make changes (BEFORE the 32-bit was operational). It did not help that he kept Processor Expert always active AND was using SVN for version control (SVN is a character oriented system and PE was making changes to ALL code files in response to his edits -- I got the blame).

Thank you - I was just interested in improving just the V1 libraries. I will browse through that forum.

The previous hardware designer has been trying to get the company to go with the Kinetics but the new designer has been against it.

TomE · ‎11-13-2012

> To see the Coldfire Support Runtime library code use

> DIR /S C:\LongLongCF.C

On your computer, not mine. I don't have any copies of CW. Your examples are enough to show it seems to be pretty bad.

> I have not seen the gcc library so I don't know if it would be better to use.

Why not follow the link I sent you, download it and have a look?

> The LongLongCF.C CINT64 multiply routine

Gcc doesn't seem to need such foolishness.

Here's an example of a constant multiply performed as a multi-bit shift, something you've previously said CW doesn't seem to do.

       avPtr->accum = avPtr->raw * FILT_LEN;
       avPtr->avrg = avPtr->raw;
       2012            movel %a2@,%d0
       2200            movel %d0,%d1
       eb89            lsll #5,%d1
       2680            movel %d0,%a3@
       2881            movel %d1,%a4@

Here's a Multiply performed as an in-line (five CPU clock) multiply:

size_t  A2uMemFwrite(const void * buf, size_t size, size_t items,
                     FILE_STRUCT *filehandle __attribute__ ((unused)))
{
401075de:       2f03            movel %d3,%sp@-
401075e0:       262e 0010       movel %fp@(16),%d3
401075e4:       2f02            movel %d2,%sp@-
        int32_t nLength = size * items;
401075e6:       242e 000c       movel %fp@(12),%d2
401075ea:       4c03 2800       mulsl %d3,%d2

Here's a bunch of Divides being performed by in-line REMSL instructions in 35 CPU clocks.

        rx.chan32[x] = val32/scale;
401080b2:       2a02            movel %d2,%d5
401080b4:       4c41 5805       remsl %d1,%d5,%d5
401080b8:       2585 9c00       movel %d5,%a2@(00000000,%a1:l:4)
        r16.chan[x] = (int)((mod16)?(val16/scale)%mod16:val16/scale);
401080bc:       4a83            tstl %d3
401080be:       6710            beqs 401080d0 401080c0:       2004            movel %d4,%d0
401080c2:       4c41 0800       remsl %d1,%d0,%d0
401080c6:       2200            movel %d0,%d1
401080c8:       4c43 1805       remsl %d3,%d5,%d1
401080cc:       2005            movel %d5,%d0
401080ce:       6006            bras 401080d6 401080d0:       2004            movel %d4,%d0
401080d2:       4c41 0800       remsl %d1,%d0,%d0
401080d6:       41f9 4027 d6c0  lea 4027d6c0 ,%a0
401080dc:       3200            movew %d0,%d1
401080de:       2009            movel %a1,%d0
401080e0:       0680 0000 0058  addil #88,%d0
401080e6:       3181 0a00       movew %d1,%a0@(00000000,%d0:l:2)

You're running the MCF51JM128 which supports the ISA_C instruction set. This is a superset of ISA_A which supports DIV and REM, except for the note in the manual that mentions "6.3.3.11 Unsupported Instruction Exception: For this device, attempted execution of valid integer divide opcodes and all MAC and EMAC instructions result in the unsupported instruction exception". Ouch! This subject has come up a few times here before, but your CW is configured properly and is calling software routines. You should still try to replace divides with multiplies (as I think you're doing).

Re: DIVU/DIVS Forcing a Reset

Re: Unsupported instruction Exception on AC256 with DIVS

> AND was using SVN for version control

Bad workman who can't use the tools properly. SVN is meant to be used with BRANCHES for separate threads of code development, separate projects, separate releases. I hope you made good use of the "Blame" facility, it seems appropriately named in your case.

> get the company to go with the Kinetics but the new designer has been against it.

Unless Freescale has written their own version of the ARM documentation I'd avoid it too. If you're now extremely familiar with the ARM chip, it is very difficult to find out anything about the specific chip you're using. If you thought the PPC documentation was a sprawling minefield, the ARM is even more so. I'm sure all the information is all there - somewhere, it is just never in the document you've just downloaded, but somewhere else.

Tom

howardheid · ‎11-15-2012

Tom,

I could post the LongLongCF.C file but I don't think FREESCALE/CODEWARRIOR would be happy about it so I won't.

Have been browsing the CodeWarrior forum but did not see a Coldfire section. Have not decided if I will repost this question there.

Have downloaded Sourcer CodeBench Lite 2 but have not looked at the libraries or PDFs yet (hope I can open the trg files).

As you know, the CFV1 does not have the HW divide but does have the HW multiply and I have used it in the replacement code (that is how I got it down to 167 cycles).

The LongLongCF.C runtime file is written to handle f32, f64, f80, & f96 floating point structure but the library code generated only shows f32 & f64 (uses f64 inplace of f80 & f96). I can reduce the byte count while maintaining source compatability by implementing the two bit-banger FOR loops as a seperate routine. If a bit/size offset is passed then the routine can be faster but may become dependant on the float structure. The fastest would be the logical manipulation method I was planning on with FF1.

Howard

TomE · ‎11-15-2012

> I could post the LongLongCF.C file

Why? I never asked for that. What I asked for was:

I wasn't after your exact custom code. I'm just after a function that performs similar arithmetic (to what you're doing) so I can compile it with gcc and the gcc library, then run it to see how fast it is, as well as allowing you to inspect the generated assembly code to see if it better or worse than CodeWarrior.

> Have been browsing the CodeWarrior forum but did not see a Coldfire section.

This new "Forum" is a "Maze of twisty little passages". Nothing is where it should be or where you would reasonably expect to find it. Everything takes at least three (or four or more) clicks when it should take one (and was one click on the previous Forum).

For instance, there is a forum for "help about the forum", but it is hidden. You have to click on the "View All Tutorials" link on the Home Page to find it and THEN you have to click on the "Contents" tab.

Or you can click HERE:

About Communities

So to get to CodeWarrior ColdFire you have to click on "Home", "Code Warrior", "Subspaces and Projects", CodeWarrior for MCU" and then "Content" to get here:

CodeWarrior for MCU

> Have downloaded Sourcer CodeBench Lite 2

Don't bother trying to read the files. Just install and run it from a command line. It should be as simple as:

m68k-elf-gcc -O2 -c -o hello.o hello.c

m68k-elf-objdump -S hello.o

That should compile your program and then give you an assembly dump. You may need to to enter "m68k-elf-gcc --target-help" to find out how to specify the right machine type.

But it would be WAY easier if you wrote a simple "hello.c" that just has some arithmetic in it similar to what you're using and then sent it to me so I could compile it for you.

Tom

fff

TomE · ‎11-09-2012

> The integer code is unavailable and I am currently trying to evaluate the floating point converts.

I wasn't after your exact custom code. I'm just after a function that performs similar arithmetic (to what you're doing) so I can compile it with gcc and the gcc library, then run it to see how fast it is, as well as allowing you to inspect the generated assembly code to see if it better or worse than CodeWarrior. See later in this for a possibly better suggestion.

> Has anyone seen/found/built a runtime library replacement for CFV1 usable in or with CodeWarrior ???

I'm suggesting gcc. It is not a "direct replacement" in CodeWarrior, but you can always use any C compiler to compile down to assembly and then use that assembly file in another development system. That's why I keep asking you to post some code (or send as a private message, or email me) so we can get a direct comparison.

> one calculation done in an optimized custom 40-bit floating point.

If the calculations are sufficiently constrained (in dynamic range) you might be able to use fixed-point. Especially if there's only "one place" for the complicated stuff.

> the 64-bit CINT64 routines were being called to perform the 32-bit calcs

> (add, subtract, and multiply (integer divide was avoided))

32-bit adds, subtracts and multiples will overflow at 32 bits anyway, so why is it using 64-bit intermediates? Especially when the 8-bitter could do all this in 16 bits. If 16 bits will handle the numeric range, would using "short" (int16_t) variables in the Coldfire code make it use 32-bit inline operations instead of calling the 64-bit functions?

> The multiply, divide, rotates, and shifts all operate as bit-bangers (one bit at a time),

> they do not use the Coldfire instructions to perform register sized operations.

That's horrible! Even the original 68000 from 1979 (33 YEARS ago) had multiple-bit shifts and rotates, both immediate and register. I have to go back to the 68HC11 to find a Motorola CPU that only has single-bit shifts. Maybe that library is a "converted HC11" one, and then they never (in 33 years) made it any better. If the library is that bad you could almost run a GP32 emulator on the ColdFire and have it run the old optimum code faster than it can run native. :-)

> Has anyone seen/found/built a runtime library replacement for CFV1 usable in or with CodeWarrior ???

How deep are you locked into CodeWarrior? Why not replace the lot if it is that bad. Download the following (Linux or Windows hosted) free compiler and use it to compile your code. Given your descriptions I'd expect it to be capable of way better results:

Sourcery CodeBench Lite Edition - Mentor Graphics

> Due to a conflict with the hardware designer ... I am detached from that project (and probably out of a job by years end)

Sorry to hear. The main purpose of Software is to correct mistakes by Hardware engineers. Either ones in the company I work for or the ones that designed the chips I'm working with (Freescale mainly). That's always been a large part of my employment.

Tom

TomE · ‎10-28-2012

Can you send me a simple sample of your code that you're seeing run slower than you like and I'll compile and run it on my ColdFire with my compiler and library.

Tom

TomE · ‎10-26-2012

I wouldn't expect there to be ongoing development for these chips. They're an old and established product, and that also means they've moved from "Development" to "Support" long ago.All the development guys are on the "bleeding edge" which is probably the i.MX chips these days.

Which libraries are you using? Are they the CodeWarrior ones? The gcc ones might be better (or vice versa).

As far as I can tell, the 08GP32 (MC68HC908GP32) runs at a maximum speed of 8MHz and takes 2 to 6 clock cycles per instruction, most of which are 8 bits. The MCF51JM128VLK should be able to run at up to 50.33MHz, and can execute one instruction per clock. That should be at least 12 times faster on simple byte operations and a lot faster on wider operations.

If you're getting the 08 to outrun the MCF51 then something's either wrong or wildly inefficient. Are you sure the MCF51 is running at the right clock speed or are you running it slow to save power? I've usually found that in many cases the power consumption is INDEPENDENT of clock speed if you're using a "low power stop" instruction when there's nothing to do, as otherwise the CPU takes twice as long at half the clock speed and half the power to do the same workload, so it all cancels out.

> compiler seems to predominatly do the 32-bit math using the LongLongCF.C CINT64 routines.

What "32-bit maths" are you performing? Is this 32-bit floating point or Mul/Div or just adds and subtracts? Most simple stuff the compiler should convert to in-line efficient code.

Tom

Coldfire V1 runtime LongLongCF.C - slow & bug

Coldfire V1 runtime LongLongCF.C - slow & bug

General