Todd Lindberg

Coldfire: Poor optimization

Discussion created by Todd Lindberg on Oct 24, 2006
Latest reply on Oct 30, 2006 by Nouchi
I'm using the CodeWarrior version 6.3 build 14 for the MCF5206e processor, and I am confounded why the compiler would produce such poor optimization on such a simple loop as the following:
   do   {      *pDst++ = *pSrc++;   } while (--i != 0);

The following is the assembly produced.  This is not the most efficient way to copy memory. The compiler doesn't make use of the post-increment assembly instructions.  This loop uses 7 instructions for the simple memory copy when it could have used only three.
0x00000024  0x2A48                   movea.l  a0,a50x00000026  0x2C49                   movea.l  a1,a60x00000028  0x2C95                   move.l   (a5),(a6)0x0000002A  0x43E90004               lea      4(a1),a10x0000002E  0x41E80004               lea      4(a0),a00x00000032  0x5382                   subq.l   #1,d20x00000034  0x66EE                   bne.s    *-16                  ; 0x00000024

However, if I re-write my code as follows, by separating the ++ operators from the assignement, then the compiler is able to produce more efficient code, as follows:
   do   {      // Separate access and ++ for better optimization on this compiler      *pDst = *pSrc; pDst++; pSrc++;   } while (--i != 0);

 The assembly produced here is much more efficient, and uses post-increment addressing, as I would have expected in the first example.
0x00000020  0x22D8                   move.l   (a0)+,(a1)+0x00000022  0x5382                   subq.l   #1,d20x00000024  0x66FA                   bne.s    *-4                   ; 0x00000020

Finally, I wanted to unroll my loop by hand to reduce the number of branches.  In either case (putting the ++ operators separately, or in the assignment operator), the optimizer does not do a good job.  It appears to be hoisting the increments to one time per loop and then using offsets in each assignment.  But by doing this, it's just reducing the efficiency because these MOVE statements are larger and slower than the simple MOVE (A0)+,(A1)+.
Is there a way to prevent the compiler from performing these anti-optimizations?  I have optimizations turned to level4 and "fast code execution" checked in the IDE.
   i /= 4;   do   {      *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization      *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization      *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization      *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization   } while (--i != 0);

0x00000028  0xE480                   asr.l    #2,d00x0000002A  0x2290                   move.l   (a0),(a1)0x0000002C  0x41E80010               lea      16(a0),a00x00000030  0x43E90010               lea      16(a1),a10x00000034  0x2368FFF4FFF4           move.l   -12(a0),-12(a1)0x0000003A  0x2368FFF8FFF8           move.l   -8(a0),-8(a1)0x00000040  0x2368FFFCFFFC           move.l   -4(a0),-4(a1)0x00000046  0x5380                   subq.l   #1,d00x00000048  0x66E0                   bne.s    *-30                  ; 0x0000002a

And this is even worse....
   i /= 4;   do   {      *pDst++ = *pSrc++;      *pDst++ = *pSrc++;      *pDst++ = *pSrc++;      *pDst++ = *pSrc++;   } while (--i != 0);

0x0000002C  0xE480                   asr.l    #2,d00x0000002E  0x2A48                   movea.l  a0,a50x00000030  0x2C49                   movea.l  a1,a60x00000032  0x2215                   move.l   (a5),d10x00000034  0x43E90004               lea      4(a1),a10x00000038  0x2A49                   movea.l  a1,a50x0000003A  0x2C81                   move.l   d1,(a6)0x0000003C  0x41E80004               lea      4(a0),a00x00000040  0x2C48                   movea.l  a0,a60x00000042  0x2216                   move.l   (a6),d10x00000044  0x41E80004               lea      4(a0),a00x00000048  0x2C48                   movea.l  a0,a60x0000004A  0x2A81                   move.l   d1,(a5)0x0000004C  0x43E90004               lea      4(a1),a10x00000050  0x2A49                   movea.l  a1,a50x00000052  0x2216                   move.l   (a6),d10x00000054  0x41E80004               lea      4(a0),a00x00000058  0x2C48                   movea.l  a0,a60x0000005A  0x2A81                   move.l   d1,(a5)0x0000005C  0x43E90004               lea      4(a1),a10x00000060  0x2A49                   movea.l  a1,a50x00000062  0x2A96                   move.l   (a6),(a5)0x00000064  0x43E90004               lea      4(a1),a10x00000068  0x41E80004               lea      4(a0),a00x0000006C  0x5380                   subq.l   #1,d00x0000006E  0x66BE                   bne.s    *-64                  ; 0x0000002e

Is there a way to get better optimizations dealing with post-increment memory accesses, or am I relegated to hand-optimize directly to assembly?
This is what I would expect to see (my hand-optimizaed code):
     asr.l    #2,d01H   move.l   (a0)+,(a1)+     move.l   (a0)+,(a1)+     move.l   (a0)+,(a1)+     move.l   (a0)+,(a1)+     subq.l   #1,d0     bne.s    1B


Message Edited by CrasyCat on 2007-04-13 01:28 PM