simple lpc43xx benchmark with disappointing  results

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

simple lpc43xx benchmark with disappointing  results

2,134 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by jokn on Sun Dec 09 13:19:22 MST 2012
I made a simple benchmark to find out the CPU performance when execute from SDRAM.
My example is based on the BOOTFAST example from the PDL. And I tested it on a hitex board. CPU Frq. is 204 MHZ SDRAM clock 102 MHz
The MIPS benchmark simple counts in a inner delay loop the cpu instructions for that loop and save the result after every seconds.

My results are
1) program + data internal RAM      127 MIPS
2) program SDRAM data internal RAM  34 MIPS
3) program + data SDRAM  28 MIPS

I would expect for the first test with both program and data in different internal RAM something like 200 MIPS or more.
What could be my reasoning error?
The MIPS when executing from SDRAM seems very disappointing for me.
Did anyone have experience with executing a program from SDRAM?

My short benchmark looks like this
I also have attached the example project.
//----------------------------------------------
void SysTick_Handler (void) {
tmips_cnt +=14;
if(msec)msec--;
}

//----------------------------------------------
void delay_ms (int ms)
{
register long cnt = mips_cnt +4;
msec = ms;
while (msec)
cnt += 5;
mips_cnt = cnt +3;
}

//----------------------------------------------
//----------------------------------------------
void my_benchmark (void)
{
SysTick_Config(CGU_GetPCLKFrequency(CGU_PERIPHERAL_M4CORE)/1000);  // Generate interrupt @ 1000 Hz

while (1) {   
mips_result = mips_cnt + tmips_cnt;
tmips_cnt = 0;
mips_cnt = 0;
delay_ms (500);
GPIO_ClearValue(LED1_PORT,(1<<LED1_BIT));
delay_ms (500);
GPIO_SetValue(LED1_PORT,(1<<LED1_BIT));
}
}

Best Regards
Josef

Original Attachment has been moved to: BOOTFAST.zip

标签 (1)
0 项奖励
回复
10 回复数

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by jokn on Fri Feb 22 12:02:26 MST 2013
Hi Dave
Finally I received some samples of the lpc4357 with internal flash and therefore make some more benchmark tests.
If find out that as before I did not check the time  optimization. Thereafter the results increases to the double.
My results now are:
225 DMIPS internal SRAM// 1.1 DMIPS / MHz
192 DMIPS internal Flash
47 DMIPS from SDRAM
18 DMIPS from SPI-Flash

That indeed is what I expected.
But also I have learned to use the time optimization very carefully.
For example my SDRAM initialization routine failed when time optimization was switched on.
So I preserved this and some other initialization modules against time optimization

I’m wondering that running from internal flash is slower than from internal Ram.
Seems NXP did not have a very quick flash.

Regards
Josef


0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by nxp21346 on Tue Feb 12 11:48:09 MST 2013
I tried this and got only 1.09 DMIPS/MHz but after turning off Keil MicroLib I got 1.48 DMIPS/MHz. This is probably because Dhrystone is very dependent on C standard library function speed.

-Dave @ NXP
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by DF9DQ on Thu Feb 07 10:28:48 MST 2013
Hi Josef,

running the test here with the ARM compiler set to speed optimization (<code>-Otime</code>), yields the following result when running from internal SRAM @ 204 MHz:

<code>

main() is at address 0x10000273

Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 10000000 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:           A
        should be:   A
Ch_2_Glob:           B
        should be:   B

[...]

Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Microseconds for one run through Dhrystone:      1.9
Dhrystones per Second:                      521920.7
DMIPS:                                         297.1
DMIPS/MHz:                                       1.46
</code>

Are you setting up the timer correctly for measuring the execution time?
The result above (10 million test runs) took exactly 19 seconds to execute, so that matches the 1.9 µs/run.

Are you sure about the maximum optimization of your compiler?
When I remove the <code>-Otime</code> switch, the result is dramatically worse:
<code>
Microseconds for one run through Dhrystone:      3.1
Dhrystones per Second:                      322788.9
DMIPS:                                         183.7
DMIPS/MHz:                                       0.90
</code>

Actually with this benchmark we are measuring the capabilities of the compiler...

Regards,
Rolf
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by nxp21346 on Wed Feb 06 12:08:03 MST 2013
There are buffers (a small cache) in the EMC and also something in the QSPI but you are right it shouldn't affect execution from SRAM (which has no buffers or cache).

-Dave @ NXP
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by 42BS on Mon Feb 04 00:27:39 MST 2013
I do not think that code-unrolling will have any effect on a non-cached system.
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by nxp21346 on Tue Jan 15 14:54:00 MST 2013
Hi!

There are a few things to check in the benchmark.

1. The M3/M4 cores have multiple buses, and the LPC43xx architecture with SRAM and bus matrix allows you to make use of it. Make sure that in your benchmark build that you have the code and data stored in different sections of SRAM so that both the data and code buses can be used.
2. You may have to try different compilers to get the best results. You might also need to experiment with optimization settings (and not just choose "the best"). For example, if you use GCC and -funroll_loops you may get worse performance because the code size is increased.

-Dave @ NXP
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by jokn on Thu Jan 03 05:08:31 MST 2013
Hi Dave

In the meantime a made the original dhrystone test on my lpc4300 board.
Running @204 MHZ from internal ram an maximal compiler optimization I get
129 DMIPS at all.
I will not split hairs, but when I read in ARM specifications 1,25 DMIPS/MHz as well as in this paper:
http://www.nxp.com/documents/brochure/75017243.pdf

I’m wondering what I’m thinking wrong

My Result as follows:

Dhrystone Benchmark, Version 2.1 (Language: C)
Program compiled without 'register' attribute
Please give the number of runs through the benchmark: 1000000
Execution starts, 1000000 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:
Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:           A
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    1000010
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          268970624
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          268970624
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           5
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING

Microseconds for one run through Dhrystone:    4.4
Dhrystones per Second:                      227790.4
                                     DMIPS: 129
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by nxp21346 on Wed Dec 26 11:16:07 MST 2012
MIPS is usually a theoretical comparison between a specific CPU and a Vax 11/780 running the Dhrystone benchmark at 1 MHz but you are not running the Dhrystone benchmark. (see http://en.wikipedia.org/wiki/Dhrystone) The reason why a benchmark is needed is to create the proper weighting to get a correct average instruction execution time since the instruction execution times vary. For example, on Cortex-M3/M4 cores branches are 2 cycles so a benchmark with too many branches would run more slowly than one with many multiplies (which take only 1 cycle).

Also, as with anything written in C, the code performance will depend on the compiler vendor, compiler version, and the optimizer options. It can be difficult to write a good benchmark because the compilers can remove portions of the benchmark when optimizing the code. But if the optimizer is not engaged, use of the full performance of the MCU will not be possible. If you are curious how many cycles ARM instructions take to execute take a look at the ARM Technical Reference Manual: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html
Also, check out the CoreMark benchmark at http://www.coremark.org/ for a benchmark that tries to avoid compiler optimization effects.

-Dave @ NXP
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by jokn on Wed Dec 19 05:11:40 MST 2012
Hi Phil

Of course you  are right. Execute code from external SDRAM is not the best solution. Thus is because I have here some very early prototype pcb’s on my desk, with LPC4330, which have no internal flash. We didn’t get the flash version some month ago. So I make some test by booting from SPI-Flash and running code from external SDRAM. I decided not to waste more time for it and to wait for newer pcbs with the lpc4357.

But what about the other issue, running code from internal RAM. Is 127 MIPS that what you might expect?
In my project the M0 core will have some time critical code and I have planed to place the M0 code into internal ram.
Under which circumstances I can expect the 200 MIPs, as given in the data sheet?

Regards
Josef
0 项奖励
回复

1,731 次查看
lpcware
NXP Employee
NXP Employee
Content originally posted in LPCWare by PhilYoung on Tue Dec 18 09:43:35 MST 2012
I don't know why you expected to get reasonable execution speeds from SDRAM on an uncached CPU.
These figures are not at all surprising and are you would expect from any vendors micro configured in such a manner.

If you have so much code that you need to execute from SDRAM then you have chose the wrong CPU, you need to look for something with I / D caches.

regards

Phil.
0 项奖励
回复