i.MX ProcessorsのトピックRe: RT1170: question about memcpy benchmark

RT1170: question about memcpy benchmark

zixunli — Fri, 29 Nov 2024 12:58:46 GMT

Hello,

I did a memcpy benchmark on MIMXRT1170-EVKB to compare speed between DTCM, cached OCRAM and non-cached OCRAM but the result is confusing.

I expect cached region will perform like DTCM but it looks like the cache doesn't provide benefit.

Each test is done multiple times to ensure cache filling.

memcpy benchmark DTCM - DTCM Loop:0 Cycle:3258 Loop:1 Cycle:432 Loop:2 Cycle:431 Loop:3 Cycle:431 Loop:4 Cycle:431 Loop:5 Cycle:431 Loop:6 Cycle:431 Loop:7 Cycle:431 DTCM - NonCache Loop:0 Cycle:534 Loop:1 Cycle:528 Loop:2 Cycle:528 Loop:3 Cycle:528 Loop:4 Cycle:528 Loop:5 Cycle:527 Loop:6 Cycle:528 Loop:7 Cycle:528 DTCM - Cache Loop:0 Cycle:538 Loop:1 Cycle:532 Loop:2 Cycle:533 Loop:3 Cycle:532 Loop:4 Cycle:532 Loop:5 Cycle:532 Loop:6 Cycle:532 Loop:7 Cycle:532 DTCM - Cache+Flush Loop:0 Cycle:863 Loop:1 Cycle:857 Loop:2 Cycle:865 Loop:3 Cycle:857 Loop:4 Cycle:865 Loop:5 Cycle:856 Loop:6 Cycle:865 Loop:7 Cycle:857

The attached example can be placed into SDKROOT\boards\evkbmimxrt1170.

I've modified linker script to ensure non-cached region is correctly set by BOARD_ConfigMPU since by default __NCACHE_REGION_SIZE is 0.

Only main.c, MIMXRT1176xxxxx_cm7_flexspi_nor.icf, MIMXRT1176xxxxx_cm7_ram.icf are modified, all other files are using kSDK default.

Linker output seems also correct:

buffer1 0x2000'0020 0x400 Data Gb main.o [5] buffer2 0x2000'0420 0x400 Data Gb main.o [5] buffer_cached 0x202c'0000 0x400 Data Gb main.o [5] buffer_ncache 0x2032'0000 0x400 Data Gb main.o [5]

Re: RT1170: question about memcpy benchmark

zixunli — Fri, 29 Nov 2024 12:59:44 GMT

Here is the project.

Re: RT1170: question about memcpy benchmark

Omar_Anguiano — Mon, 02 Dec 2024 22:17:15 GMT

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.
OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

Best regards,
Omar

Re: RT1170: question about memcpy benchmark

zixunli — Mon, 02 Dec 2024 22:32:04 GMT

Hi Omar,

Thanks for your reply, however it doesn't answer at all the question.

Cache doesn’t have effect on TCM fields. TCM interfaces are synchronous to the Cortex M7 and run at the same frequency. Hence it is expected that the access to the xTCM memories is single cycle.

Yes it's true, in my benchmark I use DTCM-to-DTCM transfer as baseline to compare other memory types.

OCRAM performance with cache performs closely to TCM as cache is single access, with cache disable the OCRAM performance is very low compared to TCM.

As you can see DTCM-NonCache vs DTCM-Cache in my benchmark, there is no performance improvement at all transfering to a cached region.

The question is why cache doesn't improve OCRAM performance.

Re: RT1170: question about memcpy benchmark

Omar_Anguiano — Thu, 05 Dec 2024 23:24:57 GMT

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.

Also before using the OCRAM area please perform a clean operation.

Best regards,
Omar

Re: RT1170: question about memcpy benchmark

zixunli — Fri, 06 Dec 2024 07:28:36 GMT

If OCRAM performance is not improving with cache it means that cache is not well implemented. Please make sure that MPU on OCRAM is non-shareable as shareable in i.MXRT means non-cacheable by default.

That's exactly what is done in the example I attached. The MPU is configured by BOARD_ConfigMPU() provided by kSDK, as I see the function does configure OCRAM as non-shareable:

/* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */ MPU->RBAR = ARM_MPU_RBAR(6, 0x20200000U); MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_FULL, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_1MB); ... ... /* Enable I cache and D cache */ #if defined(__DCACHE_PRESENT) && __DCACHE_PRESENT SCB_EnableDCache(); #endif #if defined(__ICACHE_PRESENT) && __ICACHE_PRESENT SCB_EnableICache(); #endif

Also before using the OCRAM area please perform a clean operation.

I believe you mean an cache invalidate, it is done in CMSIS function SCB_EnableDCache()