Hello,
I'm working with CodeWarrior Version: 10.1.8 Build Id:158.
I've tried to write an optimize function to calculate: output = A' x A <==> output = Dot Product(A , A)
Word40 DOT_XX_40Bits(Word16 * restrict a_pV1, int N)
{
#pragma opt_level = "O3"
int i;
Word40 Results, Sum1 = X_extend(0), Sum2= X_extend(0);
int* pV32_1 = (int*)a_pV1;
int __SR__ = readSR(); setnosat();
cw_assert((int)a_pV1%8==0); cw_assert(N>=64);cw_assert( N%8 == 0 );
for(i=0; i<N/2; i+=2)
{
Sum1 = X_macd(Sum1, pV32_1[i+0], pV32_1[i+0]);
Sum2 = X_macd(Sum2, pV32_1[i+1], pV32_1[i+1]);
}
Sum1 = X_add (Sum1, Sum2);
Results = X_asr (Sum1);
writeSR(__SR__);
return Results;
}
The inner Assembly loop is as follows:
LOOPSTART3
[
macd d2,d2,d4
macd d3,d3,d6
move.2l (r0)+,d2:d3
]
LOOPEND3
50% efficiency!
I've tried to unroll the C loop (more):
for(i=0; i<N/2; i+=4)
{
Sum1 = X_macd(Sum1, pV32_1[i+0], pV32_1[i+0]);
Sum2 = X_macd(Sum2, pV32_1[i+1], pV32_1[i+1]);
Sum3 = X_macd(Sum3, pV32_1[i+3], pV32_1[i+3]);
Sum4 = X_macd(Sum4, pV32_1[i+4], pV32_1[i+4]);
}
But the result was worse (lack of register in inner loop!?).
- Note that the unrool pragma doesn't work (and I'll be glad to understand way?).
Please help me understand how should I do it right (100% efficiency in inner loop)?
Thanks,
Perry Shoham