AnsweredAssumed Answered

MX6Q hardfp performance not good as expected

Question asked by Robbie Jiang on Nov 30, 2015
Latest reply on Dec 2, 2015 by igorpadykov

Hi all,

 

 

MX6Q is used in our project to implement some computer vision algorithms.

So we are very concentrated on the floating point performance of MX6Q.

 

The platform is our customized platform with 1GB DDR3 memory.

The test bechmark is nbench-byte-2.2.3.tar.gz.(Linux/Unix nbench )

With hardfp test, we use a hardfp rootfs from Debian.

 

(1) First, we evaluate the hardfp performance with Debian-hf rootfs and native gcc (4.6.3) compilation and the following CFLAGS to build nbench.

 

CFLAGS = -s -static -Wall -O3 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard

 

Here is the information of gcc used to build nbench application.

root@debian-armhf: apt-get install binutils gcc

 

root@debian-armhf: gcc -v

Using built-in specs.

COLLECT_GCC=gcc

COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.6/lto-wrapper

Target: arm-linux-gnueabihf

Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.3-14' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf

Thread model: posix

gcc version 4.6.3 (Debian 4.6.3-14)

 

 

Here is the score/result of the native hardfp version of nbench:

 

 

root@debian-armhf:/home/float/nbench-byte-pc# ./nbench 

 

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

 

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :           526.8  :      13.51  :       4.44

STRING SORT         :           60.88  :      27.20  :       4.21

BITFIELD            :      1.6582e+08  :      28.44  :       5.94

FP EMULATION        :          67.413  :      32.35  :       7.46

FOURIER             :          6146.9  :       6.99  :       3.93

ASSIGNMENT          :          7.6712  :      29.19  :       7.57

IDEA                :          1490.5  :      22.80  :       6.77

HUFFMAN             :          771.06  :      21.38  :       6.83

NEURAL NET          :          8.4347  :      13.55  :       5.70

LU DECOMPOSITION    :          295.52  :      15.31  :      11.05

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 24.164

FLOATING-POINT INDEX: 11.319

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

CPU                 : 4 CPU

L2 Cache            : 

OS                  : Linux 3.0.35sensor

C compiler          : gcc version 4.6.3 (Debian 4.6.3-14) 

libc                : libc-2.13.so

MEMORY INDEX        : 5.743

INTEGER INDEX       : 6.255

FLOATING-POINT INDEX: 6.278

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder

 

(2) with ubuntu armhf rootfs:

 

ubuntu@ubuntu-armhf:~/test/nbench-byte-2.2.3$ uname -a

Linux ubuntu-armhf 3.14.28-rt25-1.0.0_ga-132797-g4da02de-dirty #28 SMP PREEMPT RT Sat Oct 17 17:35:31 CST 2015 armv7l armv7l armv7l GNU/Linux

ubuntu@ubuntu-armhf:~/test/nbench-byte-2.2.3$ gcc -v

Using built-in specs.

COLLECT_GCC=gcc

COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.6/lto-wrapper

Target: arm-linux-gnueabihf

Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf

Thread model: posix

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

 

CFLAGS = -s -static -Wall -O3 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard

 

ubuntu@ubuntu-armhf:~/test/nbench-byte-2.2.3$ ./nbench

 

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

 

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :           479.2  :      12.29  :       4.04

STRING SORT         :          62.055  :      27.73  :       4.29

BITFIELD            :      1.6514e+08  :      28.33  :       5.92

FP EMULATION        :          72.022  :      34.56  :       7.97

FOURIER             :          6580.5  :       7.48  :       4.20

ASSIGNMENT          :          7.2465  :      27.57  :       7.15

IDEA                :          1507.9  :      23.06  :       6.85

HUFFMAN             :          816.73  :      22.65  :       7.23

NEURAL NET          :          8.7615  :      14.07  :       5.92

LU DECOMPOSITION    :          296.56  :      15.36  :      11.09

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 24.160

FLOATING-POINT INDEX: 11.740

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

CPU                 : 4 CPU ARMv7 Processor rev 10 (v7l)

L2 Cache            :

OS                  : Linux 3.14.28-rt25-1.0.0_ga-132797-g4da02de-dirty

C compiler          : gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

libc                : libc-2.15.so

MEMORY INDEX        : 5.663

INTEGER INDEX       : 6.319

FLOATING-POINT INDEX: 6.511

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

 

ubuntu@ubuntu-armhf:~/test/nbench-byte-2.2.3$ readelf -A nbench

Attribute Section: aeabi

File Attributes

  Tag_CPU_name: "7-A"

  Tag_CPU_arch: v7

  Tag_CPU_arch_profile: Application

  Tag_ARM_ISA_use: Yes

  Tag_THUMB_ISA_use: Thumb-2

  Tag_FP_arch: VFPv3

  Tag_Advanced_SIMD_arch: NEONv1

  Tag_ABI_PCS_wchar_t: 4

  Tag_ABI_FP_denormal: Needed

  Tag_ABI_FP_exceptions: Needed

  Tag_ABI_FP_number_model: IEEE 754

  Tag_ABI_align_needed: 8-byte

  Tag_ABI_align_preserved: 8-byte, except leaf SP

  Tag_ABI_enum_size: int

  Tag_ABI_HardFP_use: SP and DP

  Tag_ABI_VFP_args: VFP registers

  Tag_CPU_unaligned_access: v6

  Tag_DIV_use: Not allowed

 

"  Tag_ABI_VFP_args: VFP registers" shows that hardfp instructions are used.

 

 

(3) Freescale also provides a cross-compiling toolchain (gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415) to build hardfp verison of nbech:

root@debian-armhf :

/opt/freescale/gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415_linux/bin/arm-linux-gnueabihf-gcc -v

 

Using built-in specs.

COLLECT_GCC=/home/percy/project/tools/gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415_linux/bin/arm-linux-gnueabihf-gcc

COLLECT_LTO_WRAPPER=/home/percy/project/tools/gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415_linux/bin/../libexec/gcc/arm-linux-gnueabihf/4.7.3/lto-wrapper

Target: arm-linux-gnueabihf

Configured with: /cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/src/gcc-linaro-4.7-2013.04/configure --build=i686-build_pc-linux-gnu --host=i686-build_pc-linux-gnu --target=arm-linux-gnueabihf --prefix=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/install --with-sysroot=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/install/arm-linux-gnueabihf/libc --enable-languages=c,c++,fortran --enable-multilib --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard --with-pkgversion='crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 - Linaro GCC 2013.04' --with-bugurl=https://bugs.launchpad.net/gcc-linaro --enable-__cxa_atexit --enable-libmudflap --enable-libgomp --enable-libssp --with-gmp=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-mpfr=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-mpc=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-ppl=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-cloog=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-libelf=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static --with-host-libstdcxx='-L/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/.build/arm-linux-gnueabihf/build/static/lib -lpwl' --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-gold --with-local-prefix=/cbuild/slaves/oorts/crosstool-ng/builds/arm-linux-gnueabihf-linux/install/arm-linux-gnueabihf/libc --enable-c99 --enable-long-long --with-mode=thumb

Thread model: posix

gcc version 4.7.3 20130328 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 - Linaro GCC 2013.04)

 

The CFLAGS used to build nbench is :

CFLAGS = -s -static -Wall -O3 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard

 

Here is the nbench score:

 

root@debian-armhf:/home/float/nbench-byte-hf# ./nbench 

 

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

 

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          507.28  :      13.01  :       4.27

STRING SORT         :          63.044  :      28.17  :       4.36

BITFIELD            :      1.2428e+08  :      21.32  :       4.45

FP EMULATION        :           68.49  :      32.86  :       7.58

FOURIER             :          6720.8  :       7.64  :       4.29

ASSIGNMENT          :          7.1967  :      27.38  :       7.10

IDEA                :          1591.1  :      24.34  :       7.23

HUFFMAN             :             780  :      21.63  :       6.91

NEURAL NET          :          9.0302  :      14.51  :       6.10

LU DECOMPOSITION    :          306.96  :      15.90  :      11.48

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 23.276

FLOATING-POINT INDEX: 12.081

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

CPU                 : 4 CPU

L2 Cache            : 

OS                  : Linux 3.0.35sensor

C compiler          : /home/percy/project/tools/gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415_linux/bin/arm-linux-gnueabihf-gcc

libc                : static

MEMORY INDEX        : 5.166

INTEGER INDEX       : 6.341

FLOATING-POINT INDEX: 6.700

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

 

(4) Using Freescale's cross-compiling toolchaing to build softfp version of nbench:

Here the rootfs is built from LTIB-3.0.35.

 

root@freescale:

/opt/freescale/usr/local/gcc-4.6.2-glibc-2.13-linaro-multilib-2011.12/fsl-linaro-toolchain/bin/arm-linux-gcc -v

 

Using built-in specs.

COLLECT_GCC=/opt/freescale/usr/local/gcc-4.6.2-glibc-2.13-linaro-multilib-2011.12/fsl-linaro-toolchain/bin/arm-linux-gcc

COLLECT_LTO_WRAPPER=/opt/freescale/usr/local/gcc-4.6.2-glibc-2.13-linaro-multilib-2011.12/fsl-linaro-toolchain/bin/../libexec/gcc/arm-fsl-linux-gnueabi/4.6.2/lto-wrapper

Target: arm-fsl-linux-gnueabi

Configured with: /work/build/.build/src/gcc-linaro-4.6-2011.06-0/configure --build=i686-build_pc-linux-gnu --host=i686-build_pc-linux-gnu --target=arm-fsl-linux-gnueabi --prefix=/work/fsl-linaro-toolchain-2.13 --with-sysroot=/work/fsl-linaro-toolchain-2.13/arm-fsl-linux-gnueabi/multi-libs --enable-languages=c,c++ --with-pkgversion='Freescale MAD -- Linaro 2011.07 -- Built at 2011/08/10 09:20' --enable-__cxa_atexit --disable-libmudflap --disable-libgomp --disable-libssp --with-gmp=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-mpfr=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-mpc=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-ppl=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-cloog=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-libelf=/work/build/.build/arm-fsl-linux-gnueabi/build/static --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm -L/work/build/.build/arm-fsl-linux-gnueabi/build/static/lib -lpwl' --enable-threads=posix --enable-target-optspace --enable-plugin --enable-multilib --with-local-prefix=/work/fsl-linaro-toolchain-2.13/arm-fsl-linux-gnueabi/multi-libs --disable-nls --enable-c99 --enable-long-long --with-system-zlib

Thread model: posix

gcc version 4.6.2 20110630 (prerelease) (Freescale MAD -- Linaro 2011.07 -- Built at 2011/08/10 09:20) 

percy@percy-virtual-machine:gcc-4.6.2-glibc-2.13-linaro-multilib-2011.12$

 

nbench CFLAGS = -s -static -Wall -O3 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=softfp

 

Here is the nbench score:

 

root@freescale:/home/float/nbench-byte-sf# ./nbench

 

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

 

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          532.08  :      13.65  :       4.48

STRING SORT         :           62.19  :      27.79  :       4.30

BITFIELD            :      1.9527e+08  :      33.50  :       7.00

FP EMULATION        :          88.005  :      42.23  :       9.74

FOURIER             :          6905.9  :       7.85  :       4.41

ASSIGNMENT          :          7.6802  :      29.22  :       7.58

IDEA                :          1301.7  :      19.91  :       5.91

HUFFMAN             :          878.24  :      24.35  :       7.78

NEURAL NET          :          8.8688  :      14.25  :       5.99

LU DECOMPOSITION    :          305.28  :      15.82  :      11.42

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 25.796

FLOATING-POINT INDEX: 12.095

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

CPU                 : 4 CPU

L2 Cache            : 

OS                  : Linux 3.0.35sensor

C compiler          : /opt/freescale/usr/local/gcc-4.6.2-glibc-2.13-linaro-multilib-2011.12/fsl-linaro-toolchain/bin/arm-linux-gcc

libc                : static

MEMORY INDEX        : 6.110

INTEGER INDEX       : 6.694

FLOATING-POINT INDEX: 6.708

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

 

 

From above results, the nbench hardfp performance is almost the same as the softfp.

In theory,  hardfp performance  should be much better (about 20%) than softfp.

 

 

How to explain the test results?

Am I correct to enable the hardfp with the correct gcc CFLAGS?

 

What is the peak hardfp/softfp performance of MX6Q?

 

 

Robbie

Outcomes