topic Re: Inline asm GCC vs. MW (e500v2) in P-Series

Inline asm GCC vs. MW (e500v2)

juhalaukkanen — Fri, 11 Sep 2015 12:02:57 GMT

There is following function

static uint32 u32SwapEndianness(register uint32 val)

{

asm volatile("rlwinm %0, %1, 24, 0, 31" : "=r"(ret) : "r"(val));

asm volatile("rlwimi %0, %1, 8, 8, 15" : "=r"(ret) : "r"(val));

asm volatile("rlwimi %0, %1, 8, 24, 31" : "=r"(ret) : "r"(val));

return ret;

}

This generates different code with MW compiler and GCC compiler.

Below are instructions what MW emits. r31=ret, r03=val

li r31,0

rotrwi r31,r3,24

rlwimi r31,r3,8,8,15

inslwi r31,r3,8,24

mr r3,r31

So this very clean; how I excepted output to be.

Now then, below are instructions what GCC emits. r9=ret, r10=val

mr r9,r3

li r30,0

rotrwi r10,r9,24

mr r30,r10

rlwimi r10,r9,8,8,15

mr r30,r10

inslwi r9,r9,8,24

mr r30,r9

Now here first I don't understand why GCC uses weird swap/alias between r10/r30 and r9/r3. Not necessary, but not wrong.

Then the last instruction inslwi operands are wrong. Should be r10,r9...

GCC is 4.8.2 (gcc-4.8.2-Ee500v2-eabispe) and MW compiler is 4.3 build 278. Both supplied with CW 10.4. Gcc parameter '-Wa,-mregnames' used.

Re: Inline asm GCC vs. MW (e500v2)

scottwood — Fri, 11 Sep 2015 17:55:06 GMT

What optimization parameter did you pass to GCC? If you didn't enable any optimizations, it's not surprising that the code generated is far from optimal. I'm able to reproduce this with -O0 but not -O1 or -O2. BTW, with optimizations on, you shouldn't need the register keyword.

Why is using r9 wrong in inslwi? You don't show the code that uses the result. As long as that code uses r9 or r30, it's correct.

Re: Inline asm GCC vs. MW (e500v2)

juhalaukkanen — Tue, 15 Sep 2015 06:58:59 GMT

OK optimize levels indeed remove those unnecessary swaps.

However I still don't understand how GCC generated version can be right? Here's my sample with values.

// MW

r3=0x49fb6, r31=n/a

rotrwi r31,r3,24 //r31=0xb600049f, r3=0x49fb6

rlwimi r31,r3,8,8,15 //r31=0xb69f049f, r3=0x49fb6

inslwi r31,r3,8,24 //r31=0xb69f0400, r3=0x49fb6

mr r3,r31 //r3=ret

//ret -> 0xb69f0400 -> OK!

// GCC

r9=0x49fb6, r10=n/a

rotrwi r10,r9,24 // r10 = 0xb600049f, r9=0x49fb6

rlwimi r10,r9,8,8,15 // r10 = 0xb69f049f, r9=0x49fb6

inslwi r9,r9,8,24 // r9=0x49f00, r10 = 0xb69f049f

mr r3,r9 //r3=ret

//ret -> 0x49f00 -> WRONG!

Also there's single instruction which does the same. Emitted by GCC's __builtin_bswap32()

r9=0x49fb6, r0,r7 memory offset/location where store wanted (&ret).

stwbrx r9,r0,r7

//*ret -> 0xb69f0400 -> OK!

Re: Inline asm GCC vs. MW (e500v2)

mathiasparnaude — Tue, 15 Sep 2015 08:54:55 GMT

Hi Juha,

I also get 0x49f00 with the code generated by gcc. But in my opinion, the compiler does what you asked. The instructions you wrote are right but there is a problem with the use of inline asm (that is error-prone) and how parameters are declared.

At each line, you ask to apply an operation on "ret" with "val" as input but ... nothing says that "ret" is the previously modified "ret" to be reused!

By chance with MW, it does not use an additional register, so luckily reuses the same destination register on the third instruction.

For me, that works with the following code, that says "ret" is modified (what prevents it to be trashed):

static unsigned int u32SwapEndianness(register unsigned int val)

{

asm volatile("rlwinm %0, %1, 24, 0, 31" : "=r"(ret) : "r"(val));

asm volatile("rlwimi %0, %1, 8, 8, 15" : "+r"(ret) : "r"(val));

asm volatile("rlwimi %0, %1, 8, 24, 31" : "+r"(ret) : "r"(val));

return ret;

}

But the generated code is not very good and that is not good either to write independent asm lines like that. So to be cleaner (even if, again, it is difficult with inline asm code), the code should be written in a single block:

static unsigned int u32SwapEndianness(register unsigned int val)

{

asm volatile("rlwinm %0, %1, 24, 0, 31 \n\t"

"rlwimi %0, %1, 8, 8, 15 \n\t"

"rlwimi %0, %1, 8, 24, 31 \n\t" : "+r"(ret) : "r"(val));

return ret;

}

That gives something like that:

mr r10,r3

li r31,0

mr r9,r31

rotrwi r9,r10,24

rlwimi r9,r10,8,8,15

inslwi r9,r10,8,24

mr r31,r9

mr r9,r31

mr r3,r9

And these days (and with this CPU architecture), I think using the keyword is quite useless. And here, it is even worst because it forces to use registers and adds an overhead. Removing these keywords, the code looks like:

li r9,0

rotrwi r9,r10,24

rlwimi r9,r10,8,8,15

inslwi r9,r10,8,24

mr r3,r9

Mathias

Re: Inline asm GCC vs. MW (e500v2)

juhalaukkanen — Tue, 15 Sep 2015 13:23:32 GMT

Thank you. Indeed the pieces were loose and so then malformed sequence.

ps. and yes indeed this pasted block of asm is kinda useless, but it's obfuscated just as a simpler reproducible example from similar concept where using inline asm was applicable. :smileyhappy:

Re: Inline asm GCC vs. MW (e500v2)

scottwood — Tue, 15 Sep 2015 15:17:43 GMT

Because the first instruction is rlwinm, whose output is a pure output rather than rlwimi's read-modify-write, you don't need the + when you combine them all into one asm statement. Eliminating the + gets rid of the li instruction.

Re: Inline asm GCC vs. MW (e500v2)

mathiasparnaude — Wed, 16 Sep 2015 12:40:45 GMT

Hi Scott,

That looked right to me but I tried and with the single asm statement, if I use "=r" instead of "+r", the result is wrong, because r9 is used as source and destination for each instruction.

The other solution would be to use the early clobber modifier like this:

static unsigned int u32SwapEndianness(unsigned int val)

{

unsigned int ret;

asm volatile("rlwinm %0, %1, 24, 0, 31 \n\t"

"rlwimi %0, %1, 8, 8, 15 \n\t"

"rlwimi %0, %1, 8, 24, 31 \n\t" : "=&r"(ret) : "r"(val));

return ret;

}

In the same time, I removed the useless initialization of ret.