[Vybrid]  Coremark baremetal porting

cancel
Showing results for 
Search instead for 
Did you mean: 

[Vybrid]  Coremark baremetal porting

Jump to solution
1,371 Views
Contributor II

Hello

 

     I'm currently porting Coremark Benchmark to the vybrid VF6 to evaluate the performance of this SoC. I'm using the twr-vf65gs10 board. I first built it on Timesys linux and got a result around 835 Iterations per second, but when I run it on both Cortex-M4 or Cortex-A5 baremetal I get very low results compared to what I could expect. Does anyone knows why I'm not able to get the same level of performance when performing the baremetal version of the test ?

 

 

Here's the log of the Coremark performed on linux :

 

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 13165

Total time (secs): 13.165000

Iterations/Sec   : 835.548804

Iterations       : 11000

Compiler version : GCC4.8.2

Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

Memory location  : Please put data memory location here

            (e.g. code in flash, data on heap etc)

seedcrc          : 0xe9f5

[0]crclist       : 0xe714

[0]crcmatrix     : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x33ff

Correct operation validated. See readme.txt for run and reporting rules.

CoreMark 1.0 : 835.548804 / GCC4.8.2 -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

 

and some results obtained with the baremetal benchmark with different clocks and memory settings

 

***Core CA5 freq 500MHz, mem SRAM, bus 83.5MHz

 

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 9680

Total time (secs): 77

Iterations/Sec   : 129

Iterations       : 10000

Compiler version : ARMCC 5

Compiler flags   : -O3 -Otime --cpu=Cortex-A5

Memory location  : STACK

seedcrc          : 0xe9f5

[0]crclist       : 0xe714

[0]crcmatrix     : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x988c

Correct operation validated. See readme.txt for run and reporting rules.

 

 

***Core CA5 freq 396MHz, mem SRAM, bus 66MHz

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 12099

Total time (secs): 96

Iterations/Sec   : 104

Iterations       : 10000

Compiler version : ARMCC 5

Compiler flags   : -O3 -Otime --cpu=Cortex-A5

Memory location  : STACK

seedcrc          : 0xe9f5

[0]crclist       : 0xe714

[0]crcmatrix     : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x988c

Correct operation validated. See readme.txt for run and reporting rules.

 

 

***Core CA5 freq 396MHz, mem DDR , bus 66MHz

 

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 26220

Total time (secs): 209

Iterations/Sec   : 47

Iterations       : 10000

Compiler version : ARMCC 5

Compiler flags   : -O3 -Otime --cpu=Cortex-A5

Memory location  : STACK

seedcrc          : 0xe9f5

[0]crclist       : 0xe714

[0]crcmatrix     : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x988c

Correct operation validated. See readme.txt for run and reporting rules.

 

 

 

***Core CA5 freq 500MHz, mem DDR, bus 83.5MHz

 

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 23673

Total time (secs): 189

Iterations/Sec   : 52

Iterations       : 10000

Compiler version : ARMCC 5

Compiler flags   : -O3 -Otime --cpu=Cortex-A5

Memory location  : STACK

seedcrc          : 0xe9f5

[0]crclist       : 0xe714

[0]crcmatrix     : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x988c

Correct operation validated. See readme.txt for run and reporting rules.

 

 

You'll find attached to this post my workspace

 

Thank you for your help

 

Brieuc

Original Attachment has been moved to: workspace.zip

Labels (4)
1 Solution
69 Views
Senior Contributor I

Please note that there are *two* caches on the A5. Are they both turned on?

View solution in original post

0 Kudos
11 Replies
69 Views
Senior Contributor I

In your baremetal environment, make sure all caches are turned on. In the Linux environment, the kernel did that for you at boot.

Also, you're using two different compilers. However, I wouldn't expect to see *that* much difference between compilers. Those are the only two things that I see, off the bat....

69 Views
Contributor II

That's a good hint Jack, my cache wasn't enabled. With the cache I got a little improvement of my score but i'm still far from the one I got from linux :

***Core CA5 freq 500MHz, mem SRAM, bus 83.5MHz plus cache

2K performance run parameters for coremark.

CoreMark Size    : 666

Total ticks      : 7424

Total time (secs): 59

Iterations/Sec  : 169

Iterations      : 10000

Compiler version : ARMCC 5

Compiler flags  : -O3 -Otime --cpu=Cortex-A5

Memory location  : STACK

seedcrc          : 0xe9f5

[0]crclist      : 0xe714

[0]crcmatrix    : 0x1fd7

[0]crcstate      : 0x8e3a

[0]crcfinal      : 0x988c

Correct operation validated. See readme.txt for run and reporting rules.

I also did a trace to investigate the potential origin of this under performance, it reveals that the CRC computing function is running around 80% of the time, I don't know if it's usual or not

i'm currently studying that on another processor.

0 Kudos
69 Views
Senior Contributor I

Other things to note:

Linux is probably running its external DRAM at 400 Mhz. Is your bare metal code doing the same?

I suspect your bus speed for the basemetal case is a bit slow. I thought it operated at 1x the M4 core.

I'm thinking there are other memory/bus settings that are incorrect in the baremetal case.

69 Views
Contributor II

Ok that was the cache, on my first try I just added the _a5_dcache_enable() in the sysinit() function. So only one cache was enabled.

The right function is the one at the end of the startup.s file called enable_caches(). It's pretty much the same story for the cortex-M4, you have to add the functions cache_init(CODE_CACHE); and cache_init(SYS_CACHE); 

Thank you !

Brieuc

69 Views
Senior Contributor I

With both caches turn on, do you get similar numbers for the baremetal case as you do for Linux on the A5?

0 Kudos
69 Views
Contributor II

     Once the caches are enabled I get around two times the score I got on Linux. This is relevant for me because Linux is only running at 399 MHz and my baremetal porting is running at 500MHz for the Cortex-A5.

     For the cortex-M4 I get results around 300 Iterations per seconds, it's pretty low compared to the scores obtained by the competitors, I don't know if I can improve that.

Brieuc

0 Kudos
69 Views
Senior Contributor I

Hello Brieuc!

I was also Coremark'ing the same module, using A5 core. My board is pcm052, using TimeSys Linux & tools.

Nice to see I got similar results:

CoreMark  (-O2)

A5@400MHz  831.877548

A5@500MHz 1048.119150

CoreMark  (-O3)

A5@400MHz  886.524823

A5@500MHz  1122.460433

But I'm curious: you wrote that you get x2 the Linux coremark result if you run on baremetal ..?  You get ~1600 Coremark in your baremetal port, if running A5 @500?

(i'll just add that my clock speeds were: 400 = (396,396,132,66),  500 =   (500,396,166,83) )

0 Kudos
69 Views
Senior Contributor I

Note that the Cortex-M4 has two buses with specific memory aliases. You should make sure that your code and your data are using this aliases (check your linker file). The Reference Manual states in chapter 3.2.2 (Cortex-M4 Instruction Fetches on the System Bus):

Instruction fetch requests to the code bus are not registered. It is recommended that

performance critical code be located such that it fetches from the ICode bus interface as

defined by addresses < 0x2000_0000 (the system bus interface includes the addresses > =

0x2000_0000 and < 0xE000_0000 and the Private Peripheral Bus is used for addresses

>= 0xE000_0000).

For fun, I just did some measurement while I was running some bare metal code anyway:

http://falstaff.agner.ch/2014/07/09/vybrid-bare-metal-fun/

I don't see any reason why the Cortex-M4 inside Vybrid should perform differently than any other Cortex-M4 clocked at 166MHz.

--

Stefan

0 Kudos
69 Views
NXP Employee
NXP Employee

Hi Stefan,

There is a reason. Aliases can help in case of using OCRAM (SRAM). In Vybrid is SRAM memory connected to Cortex-M4 core via NIC - compare to Kinetis. It takes some time to go through it (latency) - more in AN4947.pdf. For use cases where lower latency is needed, there is TCM in Vybrid. This memory it connected directly to the core - no latency - 1 clock access. Appplication from TCM can run about 3-4 times faster compare to OCRAM without cache (L1 in the case of Cortex-M4) - nice results on link above. Once cached you can get same results like on TCM.

/Jiri

0 Kudos
69 Views
Senior Contributor I

Hi Jiri,

Yes, that is what I meant with my comment above. Sorry for the confusion. Of course, when your core don't get the instructions (stalls) then your system will perform differently.  But the Cortex-M4 core itself should really perform the same...

--

Stefan

0 Kudos
70 Views
Senior Contributor I

Please note that there are *two* caches on the A5. Are they both turned on?

View solution in original post

0 Kudos