Compiler not using VLMA.F32  FPU instruction.  Any suggestions to direct the compiler?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Compiler not using VLMA.F32  FPU instruction.  Any suggestions to direct the compiler?

4,098 Views
la_dsp
Contributor I

Using TWR-K70 with CodeWarrior 10.5, mwccarm, and mwasmarm, when I compile the following C code:

 

        output = input * k2;

        output = (z * k1) + output;

        z = output;       

 

I get the following disassembly:

 

;   74:         output = input * k2;

;

0x00000036  0x8A0AEE29             vmul.F32         s16,s18,s20

;

;   75:         output = (z * k1) + output;

;

0x0000003A  0x0AA9EE28             vmul.F32         s0,s17,s19

0x0000003E  0x8A00EE38             vadd.F32         s16,s16,s0

;

;   76:         z = output;        

;

0x00000042  0x8A48EEF0             vmov.F32         s17,s16

 

I would've expected the compiler to use a multiply accumulate for line 75 (like vlma.F32   s16,s17,s19) instead of doing the multiply and add separately.   Does anyone have any ideas on how to get the compiler to use the FPU more efficiently?

Labels (1)
0 Kudos
Reply
11 Replies

3,231 Views
trytohelp
NXP Employee
NXP Employee

Hi Angelo,

I've contacted the compiler team and got their feedback.

The compiler is not efficient in optimizing FPU instructions, in this case to generate vlma instruction.

Only way to emit this instruction is through inline assembly.

At this time, the Compiler does not emit vmla.F32 on its own efficiently.


Regards

Pascal

0 Kudos
Reply

3,231 Views
la_dsp
Contributor I

Hi Pascal,

Thanks for checking with the compiler team.

Is there any C code for which the compiler would emit vlma.f32?

Does the compiler team plan support for efficient use of vlma.f32? 

It is a crucial instruction, and is a big part of what makes the FPU hardware attractive on Kinetis parts.

I would rather not have to use inline assembly.

Thanks again,

Angelo

0 Kudos
Reply

3,231 Views
trytohelp
NXP Employee
NXP Employee

Hi Angelo,

I was in touch with Compiler team.

As explained in my previous post, CW for MCU V10.x is supporting 2 compilers.

The Freescale compiler can not generate the vlma.f32 instruction.

However the GCC compiler generates the code for -O1 and above optimization.

I've checked on my side and got the following code:

+++++++++++++++++++++++++++++

Disassembling 'main.c'...

"C:\Freescale\CW MCU v10.5\eclipse\../Cross_Tools/arm-none-eabi-gcc-4_7_3/bin/arm-none-eabi-gcc" "..\Sources\main.c" @"Sources/main.args" -o"Sources\main.o"

Sources/main.args : -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -g3 -gdwarf-2 -gstrict-dwarf -I"C:/Temp/Community/321478/MCU_10.5/GCC_K70/Project_Headers" -I"C:/Temp/Community/321478/MCU_10.5/GCC_K70/Project_Settings/Startup_Code" -I"C:/Freescale/CW MCU v10.5/MCU/ARM_GCC_Support/ewl/EWL_C/include" -I"C:/Freescale/CW MCU v10.5/MCU/ARM_GCC_Support/ewl/EWL_Runtime/include" -O1 -ffunction-sections -fdata-sections -Wall -c -fmessage-length=0 -D__VFPV4__ -specs=ewl_c.specs

"C:\Freescale\CW MCU v10.5\eclipse\../Cross_Tools/arm-none-eabi-gcc-4_7_3/bin/arm-none-eabi-objdump" "Sources\main.o" @"Sources/main.args"

Sources/main.args : -d -S -x

Sources\main.o:    file format elf32-littlearm

Sources\main.o

architecture: arm, flags 0x00000011:

HAS_RELOC, HAS_SYMS

start address 0x00000000

private flags = 5000000: [Version5 EABI]

Sections:

Idx Name          Size      VMA      LMA      File off  Algn

  0 .text        00000000  00000000  00000000  00000034  2**1

                  CONTENTS, ALLOC, LOAD, READONLY, CODE

  1 .data        00000000  00000000  00000000  00000034  2**0

                  CONTENTS, ALLOC, LOAD, DATA

  2 .bss          00000000  00000000  00000000  00000034  2**0

                  ALLOC

  3 .text.main    00000054  00000000  00000000  00000034  2**2

                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

  4 .debug_info  000000f2  00000000  00000000  00000088  2**0

                  CONTENTS, RELOC, READONLY, DEBUGGING

  5 .debug_abbrev 00000077  00000000  00000000  0000017a  2**0

                  CONTENTS, READONLY, DEBUGGING

  6 .debug_aranges 00000020  00000000  00000000  000001f1  2**0

                  CONTENTS, RELOC, READONLY, DEBUGGING

  7 .debug_macinfo 0008778c  00000000  00000000  00000211  2**0

                  CONTENTS, READONLY, DEBUGGING

  8 .debug_line  000001d2  00000000  00000000  0008799d  2**0

                  CONTENTS, RELOC, READONLY, DEBUGGING

  9 .debug_str    00000132  00000000  00000000  00087b6f  2**0

                  CONTENTS, READONLY, DEBUGGING

10 .comment      0000007a  00000000  00000000  00087ca1  2**0

                  CONTENTS, READONLY

11 .ARM.attributes 0000003b  00000000  00000000  00087d1b  2**0

                  CONTENTS, READONLY

12 .debug_frame  00000020  00000000  00000000  00087d58  2**2

                  CONTENTS, RELOC, READONLY, DEBUGGING

SYMBOL TABLE:

00000000 l    df *ABS*    00000000 main.c

00000000 l    d  .text    00000000 .text

00000000 l    d  .data    00000000 .data

00000000 l    d  .bss    00000000 .bss

00000000 l    d  .text.main    00000000 .text.main

00000000 l    d  .debug_info    00000000 .debug_info

00000000 l    d  .debug_abbrev    00000000 .debug_abbrev

00000000 l    d  .debug_aranges    00000000 .debug_aranges

00000000 l    d  .debug_macinfo    00000000 .debug_macinfo

00000000 l    d  .debug_line    00000000 .debug_line

00000000 l    d  .debug_str    00000000 .debug_str

00000000 l    d  .debug_frame    00000000 .debug_frame

00000000 l    d  .comment    00000000 .comment

00000000 l    d  .ARM.attributes    00000000 .ARM.attributes

00000000 g    F .text.main    00000052 main

00000004      O *COM*    00000004 input

00000004      O *COM*    00000004 k2

00000004      O *COM*    00000004 output

00000004      O *COM*    00000004 z

00000004      O *COM*    00000004 k1

Disassembly of section .text.main:

00000000 <main>:

{

    int counter = 0;

 

 

 

    output = input * k2;

  0:    f240 0300    movw    r3, #0

            0: R_ARM_THM_MOVW_ABS_NC    input

  4:    f2c0 0300    movt    r3, #0

            4: R_ARM_THM_MOVT_ABS    input

  8:    ed93 7a00    vldr    s14, [r3]

  c:    f240 0000    movw    r0, #0

            c: R_ARM_THM_MOVW_ABS_NC    k2

  10:    f2c0 0000    movt    r0, #0

            10: R_ARM_THM_MOVT_ABS    k2

  14:    edd0 7a00    vldr    s15, [r0]

  18:    ee27 0a27    vmul.f32    s0, s14, s15

  1c:    f240 0300    movw    r3, #0

            1c: R_ARM_THM_MOVW_ABS_NC    output

  20:    f2c0 0300    movt    r3, #0

            20: R_ARM_THM_MOVT_ABS    output

  24:    ed83 0a00    vstr    s0, [r3]

    output = (z * k1) + output;

  28:    f240 0200    movw    r2, #0

            28: R_ARM_THM_MOVW_ABS_NC    z

  2c:    f2c0 0200    movt    r2, #0

            2c: R_ARM_THM_MOVT_ABS    z

  30:    edd2 6a00    vldr    s13, [r2]

  34:    f240 0100    movw    r1, #0

            34: R_ARM_THM_MOVW_ABS_NC    k1

  38:    f2c0 0100    movt    r1, #0

            38: R_ARM_THM_MOVT_ABS    k1

  3c:    edd1 0a00    vldr    s1, [r1]

  40:    ed93 1a00    vldr    s2, [r3]

  44:    ee06 1aa0    vmla.f32    s2, s13, s1

  48:    ed83 1a00    vstr    s2, [r3]

    z = output;     

  4c:    6818          ldr    r0, [r3, #0]

  4e:    6010          str    r0, [r2, #0]

  50:    e7fe          b.n    50 <main+0x50>

  52:    bf00          nop

+++++++++++++++++++++++++++++

Attached the project used.

Regards

Pascal

0 Kudos
Reply

3,231 Views
la_dsp
Contributor I

Thanks Pascal,

The example project worked for me.   I'll try it out on my project.

Thanks again,

Angelo

0 Kudos
Reply

3,231 Views
egoodii
Senior Contributor III

Turns out IAR 6.30 doesn't seem to give me that (directly) either.  A 'scale/summation' loop of 'float' types nets this 'eight at a time' block (total taking 1.4ms at 120MHz running in RAM):

    for( uint32_t j=10000;j>0;j--)

            accum += Farray[j] * 1.07f;

  0x1fff1964: 0xea4f 0x0008  MOV.W    R0, R8

??main_3:

  0x1fff1968: 0xf1a0 0x021c  SUB.W    R2, R0, #28            ; 0x1c

  0x1fff196c: 0xed92 0x0a00  VLDR      S0, [R2]

  0x1fff1970: 0xf1a0 0x0218  SUB.W    R2, R0, #24            ; 0x18

  0x1fff1974: 0xedd2 0x0a00  VLDR      S1, [R2]

  0x1fff1978: 0xf1a0 0x0214  SUB.W    R2, R0, #20            ; 0x14

  0x1fff197c: 0xed92 0x1a00  VLDR      S2, [R2]

  0x1fff1980: 0xf1a0 0x0210  SUB.W    R2, R0, #16            ; 0x10

  0x1fff1984: 0xedd2 0x1a00  VLDR      S3, [R2]

  0x1fff1988: 0xf1a0 0x020c  SUB.W    R2, R0, #12            ; 0xc

  0x1fff198c: 0xedd0 0x3a00  VLDR      S7, [R0]

  0x1fff1990: 0xed92 0x2a00  VLDR      S4, [R2]

  0x1fff1994: 0xee63 0x3a88  VMUL.F32  S7, S7, S16

  0x1fff1998: 0xf1a0 0x0208  SUB.W    R2, R0, #8

  0x1fff199c: 0xedd2 0x2a00  VLDR      S5, [R2]

  0x1fff19a0: 0x1f02        SUBS      R2, R0, #4

  0x1fff19a2: 0xed92 0x3a00  VLDR      S6, [R2]

  0x1fff19a6: 0x19aa        ADDS      R2, R5, R6

  0x1fff19a8: 0xed92 0x4a00  VLDR      S8, [R2]

  0x1fff19ac: 0xee23 0x3a08  VMUL.F32  S6, S6, S16

  0x1fff19b0: 0xee73 0x3a84  VADD.F32  S7, S7, S8

  0x1fff19b4: 0xee62 0x2a88  VMUL.F32  S5, S5, S16

  0x1fff19b8: 0xee22 0x2a08  VMUL.F32  S4, S4, S16

  0x1fff19bc: 0xee61 0x1a88  VMUL.F32  S3, S3, S16

  0x1fff19c0: 0xee21 0x1a08  VMUL.F32  S2, S2, S16

  0x1fff19c4: 0xee60 0x0a88  VMUL.F32  S1, S1, S16

  0x1fff19c8: 0xee20 0x0a08  VMUL.F32  S0, S0, S16

  0x1fff19cc: 0xee33 0x3a23  VADD.F32  S6, S6, S7

    for( uint32_t j=10000;j>0;j--)

  0x1fff19d0: 0x3820        SUBS      R0, R0, #32            ; 0x20

  0x1fff19d2: 0x1e49        SUBS      R1, R1, #1

  0x1fff19d4: 0xee72 0x2a83  VADD.F32  S5, S5, S6

  0x1fff19d8: 0xee32 0x2a22  VADD.F32  S4, S4, S5

  0x1fff19dc: 0xee71 0x1a82  VADD.F32  S3, S3, S4

  0x1fff19e0: 0xee31 0x1a21  VADD.F32  S2, S2, S3

  0x1fff19e4: 0xee70 0x0a81  VADD.F32  S1, S1, S2

  0x1fff19e8: 0xee30 0x0a20  VADD.F32  S0, S0, S1

  0x1fff19ec: 0xed82 0x0a00  VSTR      S0, [R2, #0]

    for( uint32_t j=10000;j>0;j--)

  0x1fff19f0: 0xd1ba        BNE.N    ??main_3                ; 0x1fff1968

    printf("%d",accum);

  0x1fff19f2: 0xee10 0x0a10  VMOV      R0, S0

0 Kudos
Reply

3,231 Views
trytohelp
NXP Employee
NXP Employee

Hi Angelo,

The compiler doesn't generate the vlma.f32 code.

CW for MCU V10.6 has been released yesterday.

As previous version the tool support Freescale and GCC compilers.

The Freescale compiler will not support this instruction.

Now for the GCC ... I don't know.

On Freescale side nothing (currently) is planned to support vlma.f32.

Regards

Pascal

0 Kudos
Reply

3,231 Views
trytohelp
NXP Employee
NXP Employee

Hi Angelo,

We tried to reproduce the behavior on our side without success.

Can you please provide me an example reproducing the code ?

Regards

Pascal

0 Kudos
Reply

3,231 Views
la_dsp
Contributor I

Hi Pascal,

What disassembly did your test generate?     Did your test result in assembly using the vlma.f32 instruction?

Thanks,

Angelo

0 Kudos
Reply

3,231 Views
trytohelp
NXP Employee
NXP Employee

Angelo,

the problem is to define all the objects used for your tests code:

  output, input ,k2 , k1, Z

The type of definition is really important in order to reproduce the behavior.

Pascal

0 Kudos
Reply

3,231 Views
la_dsp
Contributor I

Hi Pascal,

I have those all defined as float.

Angelo

0 Kudos
Reply

3,231 Views
trytohelp
NXP Employee
NXP Employee

Hi Angelo,

I think I've reproduced the behavior on my side.

I will investigate it and will keep you informed ASAP.

Regards

Pascal

0 Kudos
Reply