How to operate MPC5744P faster

yoshitaka_abe_j · ‎01-15-2020

I have a problem that I can not operate the MPC5744 faster.

My example situation is followings,

1) Target-Board : DEVKIT MPC5744P (Rev B)

2) Project : SPI_DMA_MPC5744P in the S32DS Example Projects.
Note: I modified it that the cache is enable in the startup.S.

And I added the LED4=1/0 code for measuring the execution time in init_edma_tcd_15() function in edma.c.

3) The situation :
I can only operate MPC5744P at about 10.7M instruction/sec because the 101 instruction take 9.465us under SYS-Clock=160MHz in shown the below figure.

I can not operate it per clock.

I may have some mis-understands for the operation-speed of MPC5744P.

Please teach me how to operate MPC5744 faster using S32DS Power-architecture.

stanish · ‎01-17-2020

Hello Yoshitaka-san,

The routine init_dma_tcd_15() is written in form that is convenient for learning -> each individual register bit/field is set by an individual C statement. Since peripheral registers such as DMA are volatile - compiler is unable to optimize (group) bit accesses to a single 16/32 bit load/store.

Therefore I'd recommend you to replace register accesses to the same peripheral register with the single register assignment e.g.:

...
DMA_0.TCD[15].CSR.B.MAJORELINK = 0;
DMA_0.TCD[15].CSR.B.ESG = 0;
DMA_0.TCD[15].CSR.B.BWC = 0;
DMA_0.TCD[15].CSR.B.INTHALF = 0;
....‍‍‍‍‍‍

you can replace with

DMA_0.TCD[15].CSR.R = 0;‍

Instead of many individual load / bit clear / store instructions you will get single 32bit load/store sequence.

This should significantly improve the speed performance of this specific routine.

Hope it helps.

Stan

yoshitaka_abe_j · ‎01-17-2020

Hi, Stan-san,

Thank you for your reply.

I am going to change the C-source code like you recommend.
In addition, I am considering to use the inline-assembler code, too.

However, I don't know why "the MPC5744P can be operated with only about 10.7M instructions per second".

My recognition( See the below figure) at the SPI_DMA_MPC5744P project is following,
a) SYS_CLK : 160MHz
b) HALFSYS_CLK : 80MHz
c) BRIDGEx_CLK : 40MHz

and I think MPC5744 operates per clock(=SYS_CLK).

What is my recognition missed ?

If you understand it, please teach me it.

stanish · ‎01-17-2020

Hi Yoshitaka-san,

Thanks for the update.

I don't think this routine is a good candidate for benchmarking.

First of all your code runs from FLASH. There are a performance penalty caused by flash memory wait-states -> you cannot expect 1 instruction-per-cycle.
In your routine you are accessing peripheral registers (HALFSYS_CLK) with significanlty slower clock.
The pulse measurement error is high due to benchmarking just about 100 instructions. Plus the measured pulse contains the pin toggle delay. Please increase the number of instructions by factor of 10 at least. Try to measure time required for pin toggle and substract it from total measured time to get more precise result.

NOTE:

See below AN for MPC5744P optimization tips:

https://www.nxp.com/docs/en/application-note/AN4939.pdf

Hope it helps.

Stan

yoshitaka_abe_j · ‎01-19-2020

Hi Stan-san,

Thank you for your reply.

Sorry, I had to describe the detail more.

I want to operate the MPC5744P's peripheral registers access faster.

I had tried the Debug-Ram mode, too. In addition, I set the cache enable.

And the result was 9.625us slower than Debug mode(Flash base)'s 9.465us.

Therefore, I think that flash memory wait-states is no relation.

Now, I am understanding that I have to reduce the peripheral register access instructions only.

Are there any way else ?