Hello
I'm currently porting Coremark Benchmark to the vybrid VF6 to evaluate the performance of this SoC. I'm using the twr-vf65gs10 board. I first built it on Timesys linux and got a result around 835 Iterations per second, but when I run it on both Cortex-M4 or Cortex-A5 baremetal I get very low results compared to what I could expect. Does anyone knows why I'm not able to get the same level of performance when performing the baremetal version of the test ?
Here's the log of the Coremark performed on linux :
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 13165
Total time (secs): 13.165000
Iterations/Sec : 835.548804
Iterations : 11000
Compiler version : GCC4.8.2
Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x33ff
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 835.548804 / GCC4.8.2 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
and some results obtained with the baremetal benchmark with different clocks and memory settings
***Core CA5 freq 500MHz, mem SRAM, bus 83.5MHz
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 9680
Total time (secs): 77
Iterations/Sec : 129
Iterations : 10000
Compiler version : ARMCC 5
Compiler flags : -O3 -Otime --cpu=Cortex-A5
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See readme.txt for run and reporting rules.
***Core CA5 freq 396MHz, mem SRAM, bus 66MHz
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 12099
Total time (secs): 96
Iterations/Sec : 104
Iterations : 10000
Compiler version : ARMCC 5
Compiler flags : -O3 -Otime --cpu=Cortex-A5
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See readme.txt for run and reporting rules.
***Core CA5 freq 396MHz, mem DDR , bus 66MHz
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 26220
Total time (secs): 209
Iterations/Sec : 47
Iterations : 10000
Compiler version : ARMCC 5
Compiler flags : -O3 -Otime --cpu=Cortex-A5
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See readme.txt for run and reporting rules.
***Core CA5 freq 500MHz, mem DDR, bus 83.5MHz
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 23673
Total time (secs): 189
Iterations/Sec : 52
Iterations : 10000
Compiler version : ARMCC 5
Compiler flags : -O3 -Otime --cpu=Cortex-A5
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See readme.txt for run and reporting rules.
You'll find attached to this post my workspace
Thank you for your help
Brieuc
Original Attachment has been moved to: workspace.zip
Solved! Go to Solution.
Please note that there are *two* caches on the A5. Are they both turned on?
In your baremetal environment, make sure all caches are turned on. In the Linux environment, the kernel did that for you at boot.
Also, you're using two different compilers. However, I wouldn't expect to see *that* much difference between compilers. Those are the only two things that I see, off the bat....
That's a good hint Jack, my cache wasn't enabled. With the cache I got a little improvement of my score but i'm still far from the one I got from linux :
***Core CA5 freq 500MHz, mem SRAM, bus 83.5MHz plus cache
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 7424
Total time (secs): 59
Iterations/Sec : 169
Iterations : 10000
Compiler version : ARMCC 5
Compiler flags : -O3 -Otime --cpu=Cortex-A5
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x988c
Correct operation validated. See readme.txt for run and reporting rules.
I also did a trace to investigate the potential origin of this under performance, it reveals that the CRC computing function is running around 80% of the time, I don't know if it's usual or not
i'm currently studying that on another processor.
Other things to note:
Linux is probably running its external DRAM at 400 Mhz. Is your bare metal code doing the same?
I suspect your bus speed for the basemetal case is a bit slow. I thought it operated at 1x the M4 core.
I'm thinking there are other memory/bus settings that are incorrect in the baremetal case.
Ok that was the cache, on my first try I just added the _a5_dcache_enable() in the sysinit() function. So only one cache was enabled.
The right function is the one at the end of the startup.s file called enable_caches(). It's pretty much the same story for the cortex-M4, you have to add the functions cache_init(CODE_CACHE); and cache_init(SYS_CACHE);
Thank you !
Brieuc
With both caches turn on, do you get similar numbers for the baremetal case as you do for Linux on the A5?
Once the caches are enabled I get around two times the score I got on Linux. This is relevant for me because Linux is only running at 399 MHz and my baremetal porting is running at 500MHz for the Cortex-A5.
For the cortex-M4 I get results around 300 Iterations per seconds, it's pretty low compared to the scores obtained by the competitors, I don't know if I can improve that.
Brieuc
Hello Brieuc!
I was also Coremark'ing the same module, using A5 core. My board is pcm052, using TimeSys Linux & tools.
Nice to see I got similar results:
CoreMark (-O2) |
A5@400MHz 831.877548
A5@500MHz 1048.119150
CoreMark (-O3) |
A5@400MHz 886.524823
A5@500MHz 1122.460433
But I'm curious: you wrote that you get x2 the Linux coremark result if you run on baremetal ..? You get ~1600 Coremark in your baremetal port, if running A5 @500?
(i'll just add that my clock speeds were: 400 = (396,396,132,66), 500 = (500,396,166,83) )
Note that the Cortex-M4 has two buses with specific memory aliases. You should make sure that your code and your data are using this aliases (check your linker file). The Reference Manual states in chapter 3.2.2 (Cortex-M4 Instruction Fetches on the System Bus):
Instruction fetch requests to the code bus are not registered. It is recommended that
performance critical code be located such that it fetches from the ICode bus interface as
defined by addresses < 0x2000_0000 (the system bus interface includes the addresses > =
0x2000_0000 and < 0xE000_0000 and the Private Peripheral Bus is used for addresses
>= 0xE000_0000).
For fun, I just did some measurement while I was running some bare metal code anyway:
http://falstaff.agner.ch/2014/07/09/vybrid-bare-metal-fun/
I don't see any reason why the Cortex-M4 inside Vybrid should perform differently than any other Cortex-M4 clocked at 166MHz.
--
Stefan
Hi Stefan,
There is a reason. Aliases can help in case of using OCRAM (SRAM). In Vybrid is SRAM memory connected to Cortex-M4 core via NIC - compare to Kinetis. It takes some time to go through it (latency) - more in AN4947.pdf. For use cases where lower latency is needed, there is TCM in Vybrid. This memory it connected directly to the core - no latency - 1 clock access. Appplication from TCM can run about 3-4 times faster compare to OCRAM without cache (L1 in the case of Cortex-M4) - nice results on link above. Once cached you can get same results like on TCM.
/Jiri
Hi Jiri,
Yes, that is what I meant with my comment above. Sorry for the confusion. Of course, when your core don't get the instructions (stalls) then your system will perform differently. But the Cortex-M4 core itself should really perform the same...
--
Stefan
Please note that there are *two* caches on the A5. Are they both turned on?