Shared memory data update in ALIAS region between M7 and M4

GA154 · ‎05-23-2025

Dear Team,

We are facing memory updation not reflecting in alias region between M7 and M4 on IMXRT1176-EVK with MCUXpresso IDE with USB MSD with FATFS baremetal code.

We are developing an application with the following requirements:

Three buffers shared_buf[3][64KB] between M7 and M4 where M7 is master and M4 is Slave
Using M7, Read 64KB data from USB into one of the ‘shared_buf’ and send the index of buffer using MU_interrupt to M4. M7 starts reading another 64KB data from USB into another buffer
On receiving Interrupt, M4 processes the data in ‘shared_buf’, say increment each byte by 1 in buffer, and sends the processed data buffer index to M7 using MU_interrupt
On receiving Interrupt, M7 Writes the processed data into another USB MSD on Host1. The process continues

We have made the memory configuration on M7 as shown below:

M4 as shown in attched file and Memory configuration on M4 as shown below:

We have defined the shared memory of shared_buf[3][64KB] along with additional details for size and processed info of another 32bytes in ‘*(.shared_memory)’ section in ‘SRAM_CM4_ALIAS_SHARED’ region on M7 and ‘SRAM_DTC_cm4’ region on M4. ‘shared_buf’ is 32B aligned and starts at 0x2020e000 on both cores M7 and M4.

We are using the default ‘BOARD_ConfigMPU()’ function and no modifications done in the memory regions settings as shown below:

We are able access the shared memory on both cores successfully, but data changes are not reflected on either side with above mentioned memory configuration.

We tried cache cleaning functions ‘SCB_CleanDCache_by_Addr (volatile void *addr, int32_t dsize)’ before giving interrupt to M4 on M7 and SCB_CleanInvalidateDCache_by_Addr (volatile void *addr, int32_t dsize) after getting interrupt from M4 on M7 but execution becomes nearly 8 times slower.

If we change the above memory configuration to any of them as shown below and no usage of cache cleaning function, data changes are reflected properly but execution becomes nearly 8 times slower which is not desirable.

Setting-1:

Setting-2:

Setting-3:

How to achieve the data changes reflected properly with shared memory (approx. 196KB) at 0x2020e000 on both M7 and M4?

Please help us to resolve the problem.

GA154 · ‎06-02-2025

Hi @Habib_MS

We have planned another approach on M7 as mentioned below:

define 3 buffers of 64KB each buf[3][65536] in ITCM/DTCM and use for USB file I/O operations
Use another 3 buffers of 64KB each in ALIAS REGION
Define ALIAS REGION as non-cacheable
Read 64KB data from USB0 to buf[0][] in ITCM/DTCM
Use a DMA channel for Tx for memory to memory transfer from ITCM/DTCM to ALIAS REGION
Use MU to inform M4 about data sent on DMA_callback and update the buffer status for processing
Go on to read another 64KB data from USB0 to buf[1][]
Meanwhile Get data processing done by M4
Receive MU from M4 about processing done
Use another DMA channel for Rx for memory to memory transfer from ALIAS REGION to ITCM/DTCM on MU Interrupt
Update buffer status for writing on completion of Rx DMA_callback
Write Processed data to USB1 if buffer status is updated for writing otherwise go on to read another 64KB data from USB0 to another buffer if free

But in this approach, we are facing DMA transfer time is more than USB read time of M7 which is again a bottleneck for throughput.

We would like to know:

1. is it a limitation of the architecture with TCM and ALIAS region for cache coherence?

2. Are we unable to understand the architecture to use it at its full extent for our application need?

Regards,

@GA154

Habib_MS · ‎05-27-2025

Hello @GA154,

To help determine whether this resolves the issue, could you please declare the buffer using the AT_NONCACHEABLE_SECTION_INIT macro? For example:
AT_NONCACHEABLE_SECTION_INIT (uint32_t srcAddr[20])

The reason for this request is that I suspect the CM7 core may currently be writing to its cache rather than directly updating the RAM, which could be causing inconsistencies.

Additionally, for the purposes of your application, is it strictly necessary for the shared buffer to reside at address 0x2020e000? Or would it be acceptable to use address 0x202C000, specifically within the OCRAM2 region?"

I recommend taking a look at the rpmsg_lite_pingpong_cm7 example included in SDK version 25.03. It demonstrates how to initialize a shared buffer using a user-defined length, which can be very helpful for your implementation. To better understand how the example works, I also highly recommend reading the README file that comes with it—it provides detailed explanations and guidance.

BR
Habib

GA154 · ‎06-02-2025

Hi @Habib_MS
We have gone through various posts on NXP Community and found one with memory performance comparison. Then decided ALIAS REGION for shared memory with intention of zero copy between two cores.

AT_NONCACHEABLE_SECTION_INIT also makes execution slower

Regards
@GA154

GA154 · ‎06-02-2025

@Habib_MS

Thank you for your quick reply.

We have started with example provided in i.MX RT1170 Dual Core Application

We have gone through the video link provided by You. We have gone through the NXP Community previous posts specifically we found one iMXRTxxxx-Memory-Performance-ITCM-DTCM-L1-CACHE-LMEM-CACHE-OCRAM.

Then we decided to used ALIAS REGION as shared memory for our application need. We successfully configured both cores with with memories as mentioned in my request post and code executes is as expected. But, we are facing the problem with cache coherence between two cores. Cache cleaning functions were helpful in this scenario but execution is very slow.

Our understanding is that alias region is part of TCM and TCM is single cycle memory. Then why cache is intervening here?

We could find SDK 2.16 as the latest one for RT1176-EVK board.

AT_NONCACHEABLE_SECTION_INIT didn't help for any execution speed.

What else can be done? Are we missing something here?

Habib_MS · ‎06-04-2025

Hello @GA154,

In order to support you better, could you provide me more details about how you came to the following conclusion?
"We are facing the problem with cache coherence between two cores."

The cache does not intervene with the TCM (Tightly Coupled Memory); the cache is directly connected to the core. As shown in the next image obtained from Chapter 33.2.1 in the RM:

Also, in the chapter 2.1 in the app note called "Using the i.MXRT L1 Cache" mentions the next:
"The I/DTCM (FlexRAM banks configured as TCM) is accessed directly by CPU core, bypass the L1 cache."

BR
Habib

GA154 · ‎06-26-2025

Dear @Habib_MS

Sorry for my late reply.

As you have pointed out, we are also not sure how cache is intervening when we use TCM with cores. But, we set memory region as cacheable in 'BOARD_ConfigMPU()'. When we set memory region as non-cacheable, application execution time is much slower and when we set memory region as cacheable application execution is faster but data is not updated for both cores. The details of setting memory region is given below:

Hence, We ere seeking support for memory region configuration from NXP Community for faster execution and data updation for both cores.

Now we are left with running the entire logic on core M7 rather than running partly on core M7 and party on core M4.

Running the entire logic on core M7 is faster than running on two cores (theoretically should be faster than single core as processing logic is offloaded to M4 from M7).

Please do the needful.

Regards,

GA154

GA154 · ‎06-28-2025

Hi @Habib_MS

Thank you for your quick reply.

Your comment: "However, cache memory is private to each core and not visible to others. This means that if multiple cores need to share data, using cacheable memory can lead to inconsistencies, since one core's cache may not reflect the most recent changes made by another."

Yes. This makes data inconsistency between two core with shared region.

Hence, now we have left with the choice of running the entire logic on core M7.

Thank you for support.

Regards,

GA154

Habib_MS · ‎06-27-2025

Hello @GA154,
Regarding your comments:

As you have pointed out, we are also not sure how cache is intervening when we use TCM with cores. But we set memory region as cacheable in 'BOARD_ConfigMPU()'.
This parameter is present as it is required by the API, even if it does not affect TCM behavior. As it is ignored by the hardware (mentioned in my previous post) it does not have an impact.
When we set memory region as non-cacheable, application execution time is much slower and when we set memory region as cacheable application execution is faster, but data is not updated for both cores.

When memory is marked as non-cacheable, the processor must fetch data directly from external memory, which is significantly slower than accessing data from the cache. In contrast, cacheable memory allows the processor to store frequently accessed data in its local cache, which is much faster due to its proximity to the core.

However, cache memory is private to each core and not visible to others. This means that if multiple cores need to share data, using cacheable memory can lead to inconsistencies, since one core's cache may not reflect the most recent changes made by another.
BR
Habib

GA154 · ‎05-27-2025

@Habib_MS

Thank You for your quick reply.

We have already gone through AN13264.pdf mentioned in link i.MX RT1170 Dual Core Application.

We have gone through the video shared in link Multicore Processing with the i.MX RT 1170 CPU, NXP's MQX RTOS, and PHYTEC’s phyCORE-RT1170 SOM | NX...

We configured both Cores M7 as master and M4 as slave with bare-metal code and both cores are working as expected.

We need to work with shared memory of triple_buffer where each buffer is 64KB and We assumed Zero-copy will help us from CPU cycles for faster execution. Hence we used alias region for shared memory. Here triple_buffer is used for USB MSD I/O on M7 and data processing on M4.

But, We are experiencing data coherence issues when alias region is made as cacheable. Very very slowness is observed in execution when we make Alias region as non-cacheable.

Our understanding is that alias region is also the part of TCM and TCM is single cycle memory and runs at core clock.

1. Then why do we get slow execution problem when alias region is made as non-cacheable?

2. Any other settings to be done in BOARD_ConfigMPU() for alias region?

3. We could not find SDK (version 25.03) for MIMXRT1170-EVK. We could find only (SDK_2.16.000_MIMXRT1170-EVK) though we are currently using SDK_2.13

Hoping your best solution to resolve the problem.

Habib_MS · ‎05-23-2025

Hello @GA154,

I will check your memory configurations, and I will get back to you as soon as possible.
I highly recommend see these links in order to obtain more information:

Multicore Processing with the i.MX RT 1170 CPU, NXP's MQX RTOS, and PHYTEC’s phyCORE-RT1170 SOM | NX...

i.MX RT1170 Dual Core Application

SDK (version 25.03) example called "sema4_dualcore_primary_core"

BR
Habib

GA154 · ‎05-27-2025

@Habib_MS

Thank you for quick reply.

We have referred the document AN13264.pdf as mentioned in the link i.MX RT1170 Dual Core Application.

We have gone through the video given in the link Multicore Processing with the i.MX RT 1170 CPU, NXP's MQX RTOS, and PHYTEC’s phyCORE-RT1170 SOM | NX...

We need to work with triple_buffer where each buffer is of 64KB. We want to have Zeo-copy for saving the CPU cycles. Hence we chose Alias region as shared memory for triple_buffer. M7 is the master and does USB MSD I/O on triple_buffer and M4 does processing the triple_buffer. Synchronization between two cores for data read and write is well taken care.

Our application is working with both core as expected but we are facing data coherence problem when alias region is made as cacheable. There is no data coherence problem when we made alias region as non-cacheable but execution is very slow.

Our understanding is that TCM is with core and runs at core clock and single cycle memory. Hence we chose alias region for shared memory instead of OCRAM which runs at 1/4th of M7 core clock.

1. Why does cacheable make huge impact on alias region even though alias region is part of TCM?

2. Any memory configuration changes to be done in 'BOARD_ConfigMPU()' for alias region so that execution is faster and data coherence is maintained?

3. We are yet to try SDK (version 25.03) example called "sema4_dualcore_primary_core" as we are currently with SDK (version 2.13)

Hoping your best possible solution to resolve the problem.