I'm using the CodeWarrior version 6.3 build 14 for the MCF5206e processor, and I am confounded why the compiler would produce such poor optimization on such a simple loop as the following:
Code: do { *pDst++ = *pSrc++; } while (--i != 0);
The following is the assembly produced. This is not the most efficient way to copy memory. The compiler doesn't make use of the post-increment assembly instructions. This loop uses 7 instructions for the simple memory copy when it could have used only three.
Code:0x00000024 0x2A48 movea.l a0,a50x00000026 0x2C49 movea.l a1,a60x00000028 0x2C95 move.l (a5),(a6)0x0000002A 0x43E90004 lea 4(a1),a10x0000002E 0x41E80004 lea 4(a0),a00x00000032 0x5382 subq.l #1,d20x00000034 0x66EE bne.s *-16 ; 0x00000024
However, if I re-write my code as follows, by separating the ++ operators from the assignement, then the compiler is able to produce more efficient code, as follows:
Code: do { // Separate access and ++ for better optimization on this compiler *pDst = *pSrc; pDst++; pSrc++; } while (--i != 0); The assembly produced here is much more efficient, and uses post-increment addressing, as I would have expected in the first example.
Code:0x00000020 0x22D8 move.l (a0)+,(a1)+0x00000022 0x5382 subq.l #1,d20x00000024 0x66FA bne.s *-4 ; 0x00000020
Finally, I wanted to unroll my loop by hand to reduce the number of branches. In either case (putting the ++ operators separately, or in the assignment operator), the optimizer does not do a good job. It appears to be hoisting the increments to one time per loop and then using offsets in each assignment. But by doing this, it's just reducing the efficiency because these MOVE statements are larger and slower than the simple MOVE (A0)+,(A1)+.
Is there a way to prevent the compiler from performing these anti-optimizations? I have optimizations turned to level4 and "fast code execution" checked in the IDE.
Code: i /= 4; do { *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization *pDst = *pSrc; pDst++; pSrc++; // Separate ++ for better optimization } while (--i != 0); Code:0x00000028 0xE480 asr.l #2,d00x0000002A 0x2290 move.l (a0),(a1)0x0000002C 0x41E80010 lea 16(a0),a00x00000030 0x43E90010 lea 16(a1),a10x00000034 0x2368FFF4FFF4 move.l -12(a0),-12(a1)0x0000003A 0x2368FFF8FFF8 move.l -8(a0),-8(a1)0x00000040 0x2368FFFCFFFC move.l -4(a0),-4(a1)0x00000046 0x5380 subq.l #1,d00x00000048 0x66E0 bne.s *-30 ; 0x0000002a
And this is even worse....
Code: i /= 4; do { *pDst++ = *pSrc++; *pDst++ = *pSrc++; *pDst++ = *pSrc++; *pDst++ = *pSrc++; } while (--i != 0); Code:0x0000002C 0xE480 asr.l #2,d00x0000002E 0x2A48 movea.l a0,a50x00000030 0x2C49 movea.l a1,a60x00000032 0x2215 move.l (a5),d10x00000034 0x43E90004 lea 4(a1),a10x00000038 0x2A49 movea.l a1,a50x0000003A 0x2C81 move.l d1,(a6)0x0000003C 0x41E80004 lea 4(a0),a00x00000040 0x2C48 movea.l a0,a60x00000042 0x2216 move.l (a6),d10x00000044 0x41E80004 lea 4(a0),a00x00000048 0x2C48 movea.l a0,a60x0000004A 0x2A81 move.l d1,(a5)0x0000004C 0x43E90004 lea 4(a1),a10x00000050 0x2A49 movea.l a1,a50x00000052 0x2216 move.l (a6),d10x00000054 0x41E80004 lea 4(a0),a00x00000058 0x2C48 movea.l a0,a60x0000005A 0x2A81 move.l d1,(a5)0x0000005C 0x43E90004 lea 4(a1),a10x00000060 0x2A49 movea.l a1,a50x00000062 0x2A96 move.l (a6),(a5)0x00000064 0x43E90004 lea 4(a1),a10x00000068 0x41E80004 lea 4(a0),a00x0000006C 0x5380 subq.l #1,d00x0000006E 0x66BE bne.s *-64 ; 0x0000002e
Is there a way to get better optimizations dealing with post-increment memory accesses, or am I relegated to hand-optimize directly to assembly?
This is what I would expect to see (my hand-optimizaed code):
Code: asr.l #2,d01H move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ subq.l #1,d0 bne.s 1B
Thanks,
-Todd
Message Edited by CrasyCat on 2007-04-13 01:28 PM