Compiler not using VLMA.F32 FPU instruction. Any suggestions to direct the compiler?

la_dsp · ‎03-26-2014

Using TWR-K70 with CodeWarrior 10.5, mwccarm, and mwasmarm, when I compile the following C code:

output = input * k2;

output = (z * k1) + output;

z = output;

I get the following disassembly:

; 74: output = input * k2;

;

0x00000036 0x8A0AEE29 vmul.F32 s16,s18,s20

;

; 75: output = (z * k1) + output;

;

0x0000003A 0x0AA9EE28 vmul.F32 s0,s17,s19

0x0000003E 0x8A00EE38 vadd.F32 s16,s16,s0

;

; 76: z = output;

;

0x00000042 0x8A48EEF0 vmov.F32 s17,s16

I would've expected the compiler to use a multiply accumulate for line 75 (like vlma.F32 s16,s17,s19) instead of doing the multiply and add separately. Does anyone have any ideas on how to get the compiler to use the FPU more efficiently?

trytohelp · ‎04-02-2014

Hi Angelo,

I've contacted the compiler team and got their feedback.

The compiler is not efficient in optimizing FPU instructions, in this case to generate vlma instruction.

Only way to emit this instruction is through inline assembly.

At this time, the Compiler does not emit vmla.F32 on its own efficiently.

Regards

Pascal

la_dsp · ‎04-02-2014

Hi Pascal,

Thanks for checking with the compiler team.

Is there any C code for which the compiler would emit vlma.f32?

Does the compiler team plan support for efficient use of vlma.f32?

It is a crucial instruction, and is a big part of what makes the FPU hardware attractive on Kinetis parts.

I would rather not have to use inline assembly.

Thanks again,

Angelo

trytohelp · ‎04-07-2014

Hi Angelo,

I was in touch with Compiler team.

As explained in my previous post, CW for MCU V10.x is supporting 2 compilers.

The Freescale compiler can not generate the vlma.f32 instruction.

However the GCC compiler generates the code for -O1 and above optimization.

I've checked on my side and got the following code:

+++++++++++++++++++++++++++++

Disassembling 'main.c'...

"C:\Freescale\CW MCU v10.5\eclipse\../Cross_Tools/arm-none-eabi-gcc-4_7_3/bin/arm-none-eabi-gcc" "..\Sources\main.c" @"Sources/main.args" -o"Sources\main.o"

Sources/main.args : -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -g3 -gdwarf-2 -gstrict-dwarf -I"C:/Temp/Community/321478/MCU_10.5/GCC_K70/Project_Headers" -I"C:/Temp/Community/321478/MCU_10.5/GCC_K70/Project_Settings/Startup_Code" -I"C:/Freescale/CW MCU v10.5/MCU/ARM_GCC_Support/ewl/EWL_C/include" -I"C:/Freescale/CW MCU v10.5/MCU/ARM_GCC_Support/ewl/EWL_Runtime/include" -O1 -ffunction-sections -fdata-sections -Wall -c -fmessage-length=0 -D__VFPV4__ -specs=ewl_c.specs

"C:\Freescale\CW MCU v10.5\eclipse\../Cross_Tools/arm-none-eabi-gcc-4_7_3/bin/arm-none-eabi-objdump" "Sources\main.o" @"Sources/main.args"

Sources/main.args : -d -S -x

Sources\main.o: file format elf32-littlearm

Sources\main.o

architecture: arm, flags 0x00000011:

HAS_RELOC, HAS_SYMS

start address 0x00000000

private flags = 5000000: [Version5 EABI]

Sections:

Idx Name Size VMA LMA File off Algn

0 .text 00000000 00000000 00000000 00000034 2**1

CONTENTS, ALLOC, LOAD, READONLY, CODE

1 .data 00000000 00000000 00000000 00000034 2**0

CONTENTS, ALLOC, LOAD, DATA

2 .bss 00000000 00000000 00000000 00000034 2**0

ALLOC

3 .text.main 00000054 00000000 00000000 00000034 2**2

CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

4 .debug_info 000000f2 00000000 00000000 00000088 2**0

CONTENTS, RELOC, READONLY, DEBUGGING

5 .debug_abbrev 00000077 00000000 00000000 0000017a 2**0

CONTENTS, READONLY, DEBUGGING

6 .debug_aranges 00000020 00000000 00000000 000001f1 2**0

CONTENTS, RELOC, READONLY, DEBUGGING

7 .debug_macinfo 0008778c 00000000 00000000 00000211 2**0

CONTENTS, READONLY, DEBUGGING

8 .debug_line 000001d2 00000000 00000000 0008799d 2**0

CONTENTS, RELOC, READONLY, DEBUGGING

9 .debug_str 00000132 00000000 00000000 00087b6f 2**0

CONTENTS, READONLY, DEBUGGING

10 .comment 0000007a 00000000 00000000 00087ca1 2**0

CONTENTS, READONLY

11 .ARM.attributes 0000003b 00000000 00000000 00087d1b 2**0

CONTENTS, READONLY

12 .debug_frame 00000020 00000000 00000000 00087d58 2**2

CONTENTS, RELOC, READONLY, DEBUGGING

SYMBOL TABLE:

00000000 l df *ABS* 00000000 main.c

00000000 l d .text 00000000 .text

00000000 l d .data 00000000 .data

00000000 l d .bss 00000000 .bss

00000000 l d .text.main 00000000 .text.main

00000000 l d .debug_info 00000000 .debug_info

00000000 l d .debug_abbrev 00000000 .debug_abbrev

00000000 l d .debug_aranges 00000000 .debug_aranges

00000000 l d .debug_macinfo 00000000 .debug_macinfo

00000000 l d .debug_line 00000000 .debug_line

00000000 l d .debug_str 00000000 .debug_str

00000000 l d .debug_frame 00000000 .debug_frame

00000000 l d .comment 00000000 .comment

00000000 l d .ARM.attributes 00000000 .ARM.attributes

00000000 g F .text.main 00000052 main

00000004 O *COM* 00000004 input

00000004 O *COM* 00000004 k2

00000004 O *COM* 00000004 output

00000004 O *COM* 00000004 z

00000004 O *COM* 00000004 k1

Disassembly of section .text.main:

00000000 <main>:

{

int counter = 0;

output = input * k2;

0: f240 0300 movw r3, #0

0: R_ARM_THM_MOVW_ABS_NC input

4: f2c0 0300 movt r3, #0

4: R_ARM_THM_MOVT_ABS input

8: ed93 7a00 vldr s14, [r3]

c: f240 0000 movw r0, #0

c: R_ARM_THM_MOVW_ABS_NC k2

10: f2c0 0000 movt r0, #0

10: R_ARM_THM_MOVT_ABS k2

14: edd0 7a00 vldr s15, [r0]

18: ee27 0a27 vmul.f32 s0, s14, s15

1c: f240 0300 movw r3, #0

1c: R_ARM_THM_MOVW_ABS_NC output

20: f2c0 0300 movt r3, #0

20: R_ARM_THM_MOVT_ABS output

24: ed83 0a00 vstr s0, [r3]

output = (z * k1) + output;

28: f240 0200 movw r2, #0

28: R_ARM_THM_MOVW_ABS_NC z

2c: f2c0 0200 movt r2, #0

2c: R_ARM_THM_MOVT_ABS z

30: edd2 6a00 vldr s13, [r2]

34: f240 0100 movw r1, #0

34: R_ARM_THM_MOVW_ABS_NC k1

38: f2c0 0100 movt r1, #0

38: R_ARM_THM_MOVT_ABS k1

3c: edd1 0a00 vldr s1, [r1]

40: ed93 1a00 vldr s2, [r3]

44: ee06 1aa0 vmla.f32 s2, s13, s1

48: ed83 1a00 vstr s2, [r3]

z = output;

4c: 6818 ldr r0, [r3, #0]

4e: 6010 str r0, [r2, #0]

50: e7fe b.n 50 <main+0x50>

52: bf00 nop

+++++++++++++++++++++++++++++

Attached the project used.

Regards

Pascal

la_dsp · ‎04-08-2014

Thanks Pascal,

The example project worked for me. I'll try it out on my project.

Thanks again,

Angelo

egoodii · ‎04-03-2014

Turns out IAR 6.30 doesn't seem to give me that (directly) either. A 'scale/summation' loop of 'float' types nets this 'eight at a time' block (total taking 1.4ms at 120MHz running in RAM):

for( uint32_t j=10000;j>0;j--)

accum += Farray[j] * 1.07f;

0x1fff1964: 0xea4f 0x0008 MOV.W R0, R8

??main_3:

0x1fff1968: 0xf1a0 0x021c SUB.W R2, R0, #28 ; 0x1c

0x1fff196c: 0xed92 0x0a00 VLDR S0, [R2]

0x1fff1970: 0xf1a0 0x0218 SUB.W R2, R0, #24 ; 0x18

0x1fff1974: 0xedd2 0x0a00 VLDR S1, [R2]

0x1fff1978: 0xf1a0 0x0214 SUB.W R2, R0, #20 ; 0x14

0x1fff197c: 0xed92 0x1a00 VLDR S2, [R2]

0x1fff1980: 0xf1a0 0x0210 SUB.W R2, R0, #16 ; 0x10

0x1fff1984: 0xedd2 0x1a00 VLDR S3, [R2]

0x1fff1988: 0xf1a0 0x020c SUB.W R2, R0, #12 ; 0xc

0x1fff198c: 0xedd0 0x3a00 VLDR S7, [R0]

0x1fff1990: 0xed92 0x2a00 VLDR S4, [R2]

0x1fff1994: 0xee63 0x3a88 VMUL.F32 S7, S7, S16

0x1fff1998: 0xf1a0 0x0208 SUB.W R2, R0, #8

0x1fff199c: 0xedd2 0x2a00 VLDR S5, [R2]

0x1fff19a0: 0x1f02 SUBS R2, R0, #4

0x1fff19a2: 0xed92 0x3a00 VLDR S6, [R2]

0x1fff19a6: 0x19aa ADDS R2, R5, R6

0x1fff19a8: 0xed92 0x4a00 VLDR S8, [R2]

0x1fff19ac: 0xee23 0x3a08 VMUL.F32 S6, S6, S16

0x1fff19b0: 0xee73 0x3a84 VADD.F32 S7, S7, S8

0x1fff19b4: 0xee62 0x2a88 VMUL.F32 S5, S5, S16

0x1fff19b8: 0xee22 0x2a08 VMUL.F32 S4, S4, S16

0x1fff19bc: 0xee61 0x1a88 VMUL.F32 S3, S3, S16

0x1fff19c0: 0xee21 0x1a08 VMUL.F32 S2, S2, S16

0x1fff19c4: 0xee60 0x0a88 VMUL.F32 S1, S1, S16

0x1fff19c8: 0xee20 0x0a08 VMUL.F32 S0, S0, S16

0x1fff19cc: 0xee33 0x3a23 VADD.F32 S6, S6, S7

for( uint32_t j=10000;j>0;j--)

0x1fff19d0: 0x3820 SUBS R0, R0, #32 ; 0x20

0x1fff19d2: 0x1e49 SUBS R1, R1, #1

0x1fff19d4: 0xee72 0x2a83 VADD.F32 S5, S5, S6

0x1fff19d8: 0xee32 0x2a22 VADD.F32 S4, S4, S5

0x1fff19dc: 0xee71 0x1a82 VADD.F32 S3, S3, S4

0x1fff19e0: 0xee31 0x1a21 VADD.F32 S2, S2, S3

0x1fff19e4: 0xee70 0x0a81 VADD.F32 S1, S1, S2

0x1fff19e8: 0xee30 0x0a20 VADD.F32 S0, S0, S1

0x1fff19ec: 0xed82 0x0a00 VSTR S0, [R2, #0]

for( uint32_t j=10000;j>0;j--)

0x1fff19f0: 0xd1ba BNE.N ??main_3 ; 0x1fff1968

printf("%d",accum);

0x1fff19f2: 0xee10 0x0a10 VMOV R0, S0

trytohelp · ‎04-03-2014

Hi Angelo,

The compiler doesn't generate the vlma.f32 code.

CW for MCU V10.6 has been released yesterday.

As previous version the tool support Freescale and GCC compilers.

The Freescale compiler will not support this instruction.

Now for the GCC ... I don't know.

On Freescale side nothing (currently) is planned to support vlma.f32.

Regards

Pascal

trytohelp · ‎03-31-2014

Hi Angelo,

We tried to reproduce the behavior on our side without success.

Can you please provide me an example reproducing the code ?

Regards

Pascal

la_dsp · ‎03-31-2014

Hi Pascal,

What disassembly did your test generate? Did your test result in assembly using the vlma.f32 instruction?

Thanks,

Angelo

trytohelp · ‎04-01-2014

Angelo,

the problem is to define all the objects used for your tests code:

output, input ,k2 , k1, Z

The type of definition is really important in order to reproduce the behavior.

Pascal

la_dsp · ‎04-01-2014

Hi Pascal,

I have those all defined as float.

Angelo

trytohelp · ‎04-01-2014

Hi Angelo,

I think I've reproduced the behavior on my side.

I will investigate it and will keep you informed ASAP.

Regards

Pascal

Compiler not using VLMA.F32 FPU instruction. Any suggestions to direct the compiler?

Compiler not using VLMA.F32 FPU instruction. Any suggestions to direct the compiler?

General