RT1170: question about memcpy benchmark

zixunli · ‎11-29-2024

Hello,

I did a memcpy benchmark on MIMXRT1170-EVKB to compare speed between DTCM, cached OCRAM and non-cached OCRAM but the result is confusing.

I expect cached region will perform like DTCM but it looks like the cache doesn't provide benefit.

Each test is done multiple times to ensure cache filling.

memcpy benchmark
DTCM - DTCM
Loop:0   Cycle:3258
Loop:1   Cycle:432
Loop:2   Cycle:431
Loop:3   Cycle:431
Loop:4   Cycle:431
Loop:5   Cycle:431
Loop:6   Cycle:431
Loop:7   Cycle:431
DTCM - NonCache
Loop:0   Cycle:534
Loop:1   Cycle:528
Loop:2   Cycle:528
Loop:3   Cycle:528
Loop:4   Cycle:528
Loop:5   Cycle:527
Loop:6   Cycle:528
Loop:7   Cycle:528
DTCM - Cache
Loop:0   Cycle:538
Loop:1   Cycle:532
Loop:2   Cycle:533
Loop:3   Cycle:532
Loop:4   Cycle:532
Loop:5   Cycle:532
Loop:6   Cycle:532
Loop:7   Cycle:532
DTCM - Cache+Flush
Loop:0   Cycle:863
Loop:1   Cycle:857
Loop:2   Cycle:865
Loop:3   Cycle:857
Loop:4   Cycle:865
Loop:5   Cycle:856
Loop:6   Cycle:865
Loop:7   Cycle:857

The attached example can be placed into SDKROOT\boards\evkbmimxrt1170.

I've modified linker script to ensure non-cached region is correctly set by BOARD_ConfigMPU since by default __NCACHE_REGION_SIZE is 0.

Only main.c, MIMXRT1176xxxxx_cm7_flexspi_nor.icf, MIMXRT1176xxxxx_cm7_ram.icf are modified, all other files are using kSDK default.

Linker output seems also correct:

buffer1                 0x2000'0020  0x400  Data  Gb  main.o [5]
buffer2                 0x2000'0420  0x400  Data  Gb  main.o [5]
buffer_cached           0x202c'0000  0x400  Data  Gb  main.o [5]
buffer_ncache           0x2032'0000  0x400  Data  Gb  main.o [5]

Omar_Anguiano · ‎12-02-2024

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.
OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

Best regards,
Omar

zixunli · ‎12-02-2024

Hi Omar,

Thanks for your reply, however it doesn't answer at all the question.

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.

Yes it's true, in my benchmark I use DTCM-to-DTCM transfer as baseline to compare other memory types.

OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

As you can see DTCM-NonCache vs DTCM-Cache in my benchmark, there is no performance improvement at all transfering to a cached region.

The question is why cache doesn't improve OCRAM performance.

Omar_Anguiano · ‎12-05-2024

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.

Also before using the OCRAM area please perform a clean operation.

Best regards,
Omar

zixunli · ‎12-06-2024

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.

That's exactly what is done in the example I attached. The MPU is configured by BOARD_ConfigMPU() provided by kSDK, as I see the function does configure OCRAM as non-shareable:

    /* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */
    MPU->RBAR = ARM_MPU_RBAR(6, 0x20200000U);
    MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_FULL, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_1MB);

...
...

    /* Enable I cache and D cache */
#if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT
    SCB_EnableDCache();
#endif
#if defined(__ICACHE_PRESENT) && __ICACHE_PRESENT
    SCB_EnableICache();
#endif

Also before using the OCRAM area please perform a clean operation.

I believe you mean an cache invalidate, it is done in CMSIS function SCB_EnableDCache()

zixunli · ‎11-29-2024

Here is the project.