(DSC) (Codewarrior 11) 32bit variable register optimization bug in DSC Compiler

Lorenzo_Mch_IT · ‎01-24-2019

I'm using Codewarrior 11 for MCU with the DSC toolchain.

DSC compiler and/or inline assembly optimizer for MC56F84789VLL (DSP 56800EX core) do not handle correctly access to LSP (lower 16bit word) of 32bit variable when it is mapped to an accumulator register.

When using maximum optimization options ( -DOPTION_CORE_V3=1 -opt level=4 -opt speed -inline level=8 -inline auto -sprog -v3 -requireprotos -v3 ) the following byte-swapping code:

inline UINT32 swapGetUINT32(register const UINT8 *R2Reg)
{
register UINT32 AReg;
// N.B. by declaring "register UINT32 XReg;"
// into assembly istructions you can use
// Areg --> 32bit register (either dst A,B,C,D or src A10,B10,C10,D10)
// Areg.0 --> lower 16bit word (either A0,B0,C0,D0)
// Areg.1 --> upper 16bit word (either A1,B1,C1,D1)
__asm{
.optimize_iasm on
moveu.bp X:(R2Reg)+,Y1 // 1 1
moveu.bp X:(R2Reg)+,AReg // 1 1 // *(p+1) into AReg.1 , clear AReg.2,AReg.0
moveu.bp X:(R2Reg)+,Y0 // 1 1
moveu.bp X:(R2Reg)+,X0 // 1 1
asll.l #8,Y // 2 1
move.w X0,AReg.0 // 1 1
or.l Y,AReg // 1 1
.optimize_iasm off
}
return AReg;
}

INSTEAD of generating the following code (in the example below AReg gets mapped to A, but the compiler can generate the same inline sequence using A,B,C or D):

moveu.bp X:(R1)+,Y1
moveu.bp X:(R1)+,A // writes to A1 and clears A0
moveu.bp X:(R1)+,Y0
moveu.bp X:(R1)+,X0 // DSC does not have "moveu.bp X:(R1)+,A0", so we copy to X0
asll.l #0x000008,Y // shift lower bytes in Y1,Y0 to upper bytes
move.w X0,A0 // and then we copy X0 to A0
or.l Y,A // now we merge the byte-swapped upper and lower bytes

SOMETIMES the function swapGetUINT32 gets compiled as:

moveu.bp X:(R1)+,Y1
moveu.bp X:(R1)+,A // WRITES (R1) to (A1), CLEAR A2, A0
adda #-3,SP,R4
move.l A10,X:(R4) // WRITES A0 to (SP-3), A1 to (SP-2)
moveu.bp X:(R1)+,Y0
moveu.bp X:(R1)+,X0
asll.l #0x000008,Y // shift lower bytes in Y1,Y0 to upper bytes
move.w X0,X:(SP-2) // WRITES X0 to (SP-2) (where A1 was stored, overwriting it)
move.l X:(R4),A // reload A (it contains A1 = X0, A0 = 0 )
or.l Y,A // now we merge, but the result is not a correct 32bit byte-swap

The resulting code is less efficient and the returned value is WRONG!

It seems like the compiler or the code optimizer assumes "moveu.bp X:(R1)+,A" writes (R1) to A0, while on 56800EX cores a 16bit write to an accumulator register (either A,B,C or D) writes to the MSP (either A1,B1,C1,D1) and CLEARS the associated LSP and EXT registers.

Also the optimizer chooses to write A to external memory and then read it back

when it is absolutely not necessary.

The attached files contain the source code to reproduce the bug and a disassembly of the output.

In the attached file xxxx.c I also included two solutions to this problem (both are based on forcing the usage of A register instead of letting the compiler choose the best optimization).

I solved the issue in my program by modifying the assembly code, but the root cause of this bug needs to be identified and fixed, because it is likely to have more widespread manifestations than just this one

( it wrongly handles register mapping and stack access of 32bit values).

ZhangJennie · ‎01-28-2019

Hi Lorenzo Micheletto,

Can you please send us a demo project for this problem? We need to reproduce it on our side.

Please also specify how to reproduce your problem with the demo.

Thanks

Jun Zhang

Lorenzo_Mch_IT · ‎01-28-2019

Hello Jun Zhang,

The attached file TEST_20190128.zip contains a project reproducing the problem.

When the test code is run, the "outer level" sequence wrongly outputs different values for nominal and nominal2

while the "inside if()" sequence correctly outputs the same value for nominal and nominal 2.

In file main.c inline function swapGetUINT32() gets wrongly expanded inside the outer level of

function APDU_FakeNetConf(), while inside the body of the if(){ ... } statement

it gets correctly expanded.

swapGetUINT32() was coded so that Areg variable could be mapped by the compiler's

code optimizer to one of the accumulator registers (A,B,C or D) .

But instead in the outer level it gets written to stack and accessed in the wrong word order

when accessing its lower word.

So there are two bugs:

1) a register allocation optimization bug (compiler does not understand it is not

necessary to write Areg to stack)

2) wrong access to lower word of 32bit variable when mapped on stack.

It would be correct if the WHOLE 32bit variable was accessed

or if the intention was to access the HIGHER 16bit word

but it is WRONGwrong when trying to access the LOWER 16bit word

That is, the faulty code generated in the "outer level" sequence of instructions is:

adda #-3,SP,R1
move.l A10,X:(R1) // writes lower word to (SP-3), higher word to (SP-2)
......
move.w X0,X:(SP-2) // writes X0 to the higher word of the 32bit variable

while the "correct" code (besides not taking advantage of register caching) should be:

adda #-3,SP,R1
move.l A10,X:(R1) // writes lower word to (SP-3), higher word to (SP-2)
......
move.w X0,X:(SP-3) // writes X0 to the lower word of the 32bit variable

and the "perfect" code (as generated in the "inside if()" sequence of statements) is:

move.w X0,A0

I hope this will be useful to pinpoint the source of the bug.

Best Regards

Lorenzo Micheletto

ZhangJennie · ‎01-30-2019

Hi Lorenzo Micheletto,

Thanks for your reply.

I have escalated this problem to develpement team. I will let you know if I get any feedback from them.

Thanks,

Jun Zhang