RT1170: question about memcpy benchmark

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

RT1170: question about memcpy benchmark

1,200 次查看
zixunli
Contributor I

Hello,

I did a memcpy benchmark on MIMXRT1170-EVKB to compare speed between DTCM, cached OCRAM and non-cached OCRAM but the result is confusing.

I expect cached region will perform like DTCM but it looks like the cache doesn't provide benefit.

Each test is done multiple times to ensure cache filling.

 

memcpy benchmark
DTCM - DTCM
Loop:0   Cycle:3258
Loop:1   Cycle:432
Loop:2   Cycle:431
Loop:3   Cycle:431
Loop:4   Cycle:431
Loop:5   Cycle:431
Loop:6   Cycle:431
Loop:7   Cycle:431
DTCM - NonCache
Loop:0   Cycle:534
Loop:1   Cycle:528
Loop:2   Cycle:528
Loop:3   Cycle:528
Loop:4   Cycle:528
Loop:5   Cycle:527
Loop:6   Cycle:528
Loop:7   Cycle:528
DTCM - Cache
Loop:0   Cycle:538
Loop:1   Cycle:532
Loop:2   Cycle:533
Loop:3   Cycle:532
Loop:4   Cycle:532
Loop:5   Cycle:532
Loop:6   Cycle:532
Loop:7   Cycle:532
DTCM - Cache+Flush
Loop:0   Cycle:863
Loop:1   Cycle:857
Loop:2   Cycle:865
Loop:3   Cycle:857
Loop:4   Cycle:865
Loop:5   Cycle:856
Loop:6   Cycle:865
Loop:7   Cycle:857

 

 

The attached example can be placed into SDKROOT\boards\evkbmimxrt1170.

I've modified linker script to ensure non-cached region is correctly set by BOARD_ConfigMPU since by default __NCACHE_REGION_SIZE is 0.

Only main.c, MIMXRT1176xxxxx_cm7_flexspi_nor.icf, MIMXRT1176xxxxx_cm7_ram.icf are modified, all other files are using kSDK default.

Linker output seems also correct:

 

buffer1                 0x2000'0020  0x400  Data  Gb  main.o [5]
buffer2                 0x2000'0420  0x400  Data  Gb  main.o [5]
buffer_cached           0x202c'0000  0x400  Data  Gb  main.o [5]
buffer_ncache           0x2032'0000  0x400  Data  Gb  main.o [5]

 

 

0 项奖励
回复
5 回复数

1,136 次查看
Omar_Anguiano
NXP TechSupport
NXP TechSupport

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.
OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

Best regards,
Omar

0 项奖励
回复

1,132 次查看
zixunli
Contributor I

Hi Omar,

 

Thanks for your reply, however it doesn't answer at all the question.

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.

Yes it's true, in my benchmark I use DTCM-to-DTCM transfer as baseline to compare other memory types.

OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

As you can see DTCM-NonCache vs DTCM-Cache in my benchmark, there is no performance improvement at all transfering to a cached region.

The question is why cache doesn't improve OCRAM performance.

 

 

 

0 项奖励
回复

1,059 次查看
Omar_Anguiano
NXP TechSupport
NXP TechSupport

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable  as shareable in i.MXRT means non-cacheable by default.

Also before using the OCRAM area please perform a clean operation.

Best regards,
Omar

0 项奖励
回复

1,017 次查看
zixunli
Contributor I

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable  as shareable in i.MXRT means non-cacheable by default.

That's exactly what is done in the example I attached. The MPU is configured by BOARD_ConfigMPU() provided by kSDK, as I see the function does configure OCRAM as non-shareable:

 

    /* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */
    MPU->RBAR = ARM_MPU_RBAR(6, 0x20200000U);
    MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_FULL, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_1MB);

...
...

    /* Enable I cache and D cache */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT
    SCB_EnableDCache();
#endif
#if defined(__ICACHE_PRESENT) && __ICACHE_PRESENT
    SCB_EnableICache();
#endif

 

 

Also before using the OCRAM area please perform a clean operation.

I believe you mean an cache invalidate, it is done in CMSIS function SCB_EnableDCache() 

0 项奖励
回复

1,199 次查看
zixunli
Contributor I

Here is the project.

0 项奖励
回复