iMXRT1176: poor OCRAM read performance

udoeb · ‎10-22-2021

Hello,

It looks like (non-cached) reading from OCRAM is ~12x times slower than writing.
In our application non-cached read/write performance is important because we have many DMA buffers.

Here are some figures:

CM7 core:
SystemCoreClock = 996000000 Hz, MPU disabled, DCache disabled
                  READ           WRITE
DTCM:       1762252602       963536823    words/s
OCRAM1:       41535756       480395668    words/s

CM4 core:
SystemCoreClock = 392727258 Hz, MPU disabled, DCache disabled
                  READ           WRITE
DTCM:        321593532       348561397    words/s
OCRAM1:       21156189        17457387    words/s

Notes:
1 word = 4 bytes = 32 bits
Test code executes from ITCM.

In either case, based on the respective CPU clock, DTCM performance makes sense to me.

Questions:
1) Why is read access much slower than write access?
2) Why is access to OCRAM from CM4 core much slower than CM7?

Thanks in advance for any comments.
Udo

jingpan · ‎10-26-2021

Hi @udoeb ,

There is little work you can do on hardware to improve read speed. Please refer to AN12437 to see if there is some way by software.

Regards,

Jing

元の投稿で解決策を見る

jingpan · ‎10-26-2021

Hi @udoeb ,

1. M7 core access OCRAM1 & OCRAM2 via AXI bus and controlled by NIC-301 AXI arbiter IP. ARM core visit AXI bus using pipeline mechanism. When writing data, the code is pipelined. It not means the instruction is executed to put data on bus immediately. But when execute read instruction, it must wait till the data coming back, pipeline doesn't have any help. This is why write OCRAM is much faster than read.

2. This is because CM4 and CM7 have different path to access OCRAM1 and OCRAM2. Please see the Figure 2-2 in reference manual. The CM4 requests data from OCRAM through XB (LPSR domain - AHB protocol) and then through NIC (WAKEUPMIX domain AXI protocol) and the clock limitation is BUS / BUS_LPSR. Both OCRAMs are accessible only via SYSTEM bus (so, in such case no harward possible). If any other bus masters are accessing the same memory (OCRAM1, or OCRAM2) the performance is even more degraded due to arbitration (on XB or NIC).

Regards,

Jing

udoeb · ‎10-26-2021

Hi @jingpan,

Thanks for your feedback. I understand. More questions:

3) Is there anything we can do to improve M4 access to OCRAM?

4) Is our clock setup optimal? BUS_CLK, BUS_LPSR_CLK and M4_CLK come from SYS_PLL3 (480MHz) while M7_CLK and AXI_CLK come from ARM_PLL. Some values from the generated clock_config.c are shown below.

...
- {id: ARM_PLL_CLK.outFreq, value: 996 MHz}
- {id: AXI_CLK_ROOT.outFreq, value: 996 MHz}
- {id: M7_CLK_ROOT.outFreq, value: 996 MHz}
- {id: BUS_CLK_ROOT.outFreq, value: 240 MHz}
- {id: BUS_LPSR_CLK_ROOT.outFreq, value: 160 MHz}
- {id: M4_CLK_ROOT.outFreq, value: 4320/11 MHz}
...

jingpan · ‎10-26-2021

Hi @udoeb ,

There is little work you can do on hardware to improve read speed. Please refer to AN12437 to see if there is some way by software.

Regards,

Jing