Hello,
I did a memcpy benchmark on MIMXRT1170-EVKB to compare speed between DTCM, cached OCRAM and non-cached OCRAM but the result is confusing.
I expect cached region will perform like DTCM but it looks like the cache doesn't provide benefit.
Each test is done multiple times to ensure cache filling.
memcpy benchmark
DTCM - DTCM
Loop:0 Cycle:3258
Loop:1 Cycle:432
Loop:2 Cycle:431
Loop:3 Cycle:431
Loop:4 Cycle:431
Loop:5 Cycle:431
Loop:6 Cycle:431
Loop:7 Cycle:431
DTCM - NonCache
Loop:0 Cycle:534
Loop:1 Cycle:528
Loop:2 Cycle:528
Loop:3 Cycle:528
Loop:4 Cycle:528
Loop:5 Cycle:527
Loop:6 Cycle:528
Loop:7 Cycle:528
DTCM - Cache
Loop:0 Cycle:538
Loop:1 Cycle:532
Loop:2 Cycle:533
Loop:3 Cycle:532
Loop:4 Cycle:532
Loop:5 Cycle:532
Loop:6 Cycle:532
Loop:7 Cycle:532
DTCM - Cache+Flush
Loop:0 Cycle:863
Loop:1 Cycle:857
Loop:2 Cycle:865
Loop:3 Cycle:857
Loop:4 Cycle:865
Loop:5 Cycle:856
Loop:6 Cycle:865
Loop:7 Cycle:857
The attached example can be placed into SDKROOT\boards\evkbmimxrt1170.
I've modified linker script to ensure non-cached region is correctly set by BOARD_ConfigMPU since by default __NCACHE_REGION_SIZE is 0.
Only main.c, MIMXRT1176xxxxx_cm7_flexspi_nor.icf, MIMXRT1176xxxxx_cm7_ram.icf are modified, all other files are using kSDK default.
Linker output seems also correct:
buffer1 0x2000'0020 0x400 Data Gb main.o [5]
buffer2 0x2000'0420 0x400 Data Gb main.o [5]
buffer_cached 0x202c'0000 0x400 Data Gb main.o [5]
buffer_ncache 0x2032'0000 0x400 Data Gb main.o [5]
Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.
OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.
Best regards,
Omar
Hi Omar,
Thanks for your reply, however it doesn't answer at all the question.
Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.
Yes it's true, in my benchmark I use DTCM-to-DTCM transfer as baseline to compare other memory types.
OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.
As you can see DTCM-NonCache vs DTCM-Cache in my benchmark, there is no performance improvement at all transfering to a cached region.
The question is why cache doesn't improve OCRAM performance.
If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.
Also before using the OCRAM area please perform a clean operation.
Best regards,
Omar
If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.
That's exactly what is done in the example I attached. The MPU is configured by BOARD_ConfigMPU() provided by kSDK, as I see the function does configure OCRAM as non-shareable:
/* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */
MPU->RBAR = ARM_MPU_RBAR(6, 0x20200000U);
MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_FULL, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_1MB);
...
...
/* Enable I cache and D cache */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT
SCB_EnableDCache();
#endif
#if defined(__ICACHE_PRESENT) && __ICACHE_PRESENT
SCB_EnableICache();
#endif
Also before using the OCRAM area please perform a clean operation.
I believe you mean an cache invalidate, it is done in CMSIS function SCB_EnableDCache()