Question about the memcpy performance on the m7 core.

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Question about the memcpy performance on the m7 core.

746 Views
seungtake
Contributor II

I have a question about the memcpy speed on the M7 core.

For data sharing between the M7 and A53 cores, I implemented a simple example based on the IPCF demo. The code is straightforward: every second, the M7 copies about 3 MB of data into a non-cacheable memory region using memcpy, then triggers an MSI interrupt to let the Linux side (running on the A53) read the data. The data size is fixed, and there are no management structures like “managed/unmanaged” buffers — the M7 simply writes to a fixed memory address, and the A53 only reads after receiving the interrupt.

However, the memcpy operation on the M7 side seems significantly slower than expected. When measured with TRACE32, the memcpy function alone takes about 180 ms. I understand the measurement might include some debugger latency, but this still feels unusually long.

Is this memcpy speed to non-cacheable memory typical on the M7 core?
Currently, DMA is not used, and the addresses and data size are 8‑byte aligned.

I would appreciate any advice on how to improve the memcpy performance in this scenario.

0 Kudos
Reply
12 Replies

486 Views
chenyin_h
NXP Employee
NXP Employee

Hello, @seungtake 

I understand that currently you measured the performance via the debugger, may I know if it is possible to measure it with the timer,  in order for obtaining more accurate results?

 

BR

Chenyin

0 Kudos
Reply

168 Views
seungtake
Contributor II
Hi, @chenyin_h

I measured the SRAM to SRAM copy time using the STM timer.

The measurement code is as follows, and the STM module clock was checked in Linux.

STM module clock = 132206143

<test code>

stm->CR = 0x00000D00; // Divider=13 for ~98ns resolution
stm->CNT = 0; // Init counter
stm->CR |= 0x00000001;

memcpy((void*)(0x35000000), (void*)(buf_ptr), total_send_size);

stm->CR &= ~(0x00000001); // stop timer
end_time = stm->CNT;
elapsed_ticks = end_time;
// STM Module Clock: 132,206,143Hz, Divider=13 → Timer freq = 10,169,703Hz
// Period = 1,000,000,000ns ÷ 10,169,703 = 98.33ns/tick
elapsed_time_ns = elapsed_ticks * 98;
elapsed_time_us = elapsed_time_ns / 1000;
elapsed_time_ms = elapsed_time_us / 1000;

elapsed_ticks = 1761109
elapsed_time_ms = 172

Although the calculation was approximate, the actual time taken is about 171 ms, which is only about 10 ms different from the measurement result using the debugger.
Will using eDMA make the speed faster? Basically, DMA increases the core’s utilization, so I’m not sure how much performance improvement there will be, but if eDMA transfers data in bursts, I am considering trying it.
0 Kudos
Reply

123 Views
chenyin_h
NXP Employee
NXP Employee

Thanks for sharing more information. @seungtake 

1. Yes, you may try it with DMA to see if there are any improvements.

2. Currently, from available performance data I had, I did not find similar formal benchmark data for your reference, but I have discussed it internally to search for further resources, if any feedback or valuable data would be shared, I will reply you

Thanks

 

BR

Chenyin

0 Kudos
Reply

463 Views
seungtake
Contributor II
Hi, @chenyin_h

Yes, I will test with the OS timer or hardware timer instead ofo the debugger and then let you know the result.

Thank you.
0 Kudos
Reply

700 Views
chenyin_h
NXP Employee
NXP Employee

Hello, @seungtake 

Thanks for your reply.

If you did not use the M core application to set the frequency, then it is usually 400M for your M7 core

Would you share more details of your test functions? it is a memcpy test from SRAM to SRAM, or SRAM to DRAM or else?

By checking the related documents, I do not find similar formal performance data that could be shared for reference.

 

 

BR

Chenyin

0 Kudos
Reply

661 Views
seungtake
Contributor II
Hi, @chenyin_h

Yes, I did not change the frequency on my m7 application.

My test code copies data chunks from SRAM to SRAM.

And my code as below

static int32_t priv_svc_lipcf_cqueue_enqueue(struct lipcf_shm_cqueue_t* cqueue, uint8_t* buf, uint32_t size)
{
int32_t ret = 0;

uint32_t queue_end = 0;
uint32_t first_chunk_size = 0;
uint32_t second_chunk_size = 0;

ret = priv_svc_lipcf_check_ptr(cqueue);

if (ret == 0) {
ret = priv_svc_lipcf_check_ptr(buf);
}

if (ret == 0) {
queue_end = cqueue->queue_base + cqueue->queue_size;
if ((cqueue->queue_head + size) <= queue_end) {
memcpy((void*)(cqueue->queue_head), (void*)(buf), size);
cqueue->queue_head += size;
if (cqueue->queue_head >= queue_end) {
cqueue->queue_head = cqueue->queue_base;
}
}
else {
first_chunk_size = (queue_end - cqueue->queue_head);
second_chunk_size = (size - first_chunk_size);
if (first_chunk_size > 0) {
memcpy((void*)(cqueue->queue_head), (void*)(buf), first_chunk_size);
}

if (second_chunk_size > 0) {
memcpy((void*)(cqueue->queue_base), (void*)(buf + first_chunk_size), second_chunk_size);
}
cqueue->queue_head = cqueue->queue_base + second_chunk_size;
}

if (cqueue->queue_head == cqueue->queue_tail) {
cqueue->queue_full = 1;
}

ret = size;
}

return ret;
}

Thank you.

0 Kudos
Reply

619 Views
chenyin_h
NXP Employee
NXP Employee

Hello, @seungtake 

Thanks for your reply

What is the implementation of your memcpy? using some kinds of lib or implemented yourself?

What is your optimization level used for building the libs and/or your test application?

 

BR

Chenyin

 

 

0 Kudos
Reply

594 Views
seungtake
Contributor II
Hi, @chenyin_h

I'm using the memcpy implementation from the GCC library. GCC version: gcc-10.2.0-Earmv7GCC-eabi (downloaded from the NXP site) Optimization level: -Os (optimize for code size)

Thank you.
0 Kudos
Reply

578 Views
chenyin_h
NXP Employee
NXP Employee

Hello, @seungtake 

Thanks for your information.

I will check if there are some additional benchmark could be or had done for this SRAM performance query, it may take some time and I will be back if any updates found.

Thanks for your understanding.

 

BR

Chenyin  

0 Kudos
Reply

716 Views
chenyin_h
NXP Employee
NXP Employee

Hello, @seungtake 

Thanks for your post.

May I know if you are using S32G2 or G3? What is the frequency set on your M7?

 

BR

Chenyin

0 Kudos
Reply

710 Views
seungtake
Contributor II
Hi, Chenyin

I am currently using the S32G399ARDB3 reference board. I am using the device tree provided in BSP44.0 and starting the M7 core through startm7 in u-boot. Could you please advise how I can check the M7 clock frequency?

Thank you.
0 Kudos
Reply

601 Views
seungtake
Contributor II
Hi, @chenyin_h

Here is my clk configuration in u-boot
The M7 clock appears to be set to 396618429.

Thank you.

[uboot result]
=> clk dump
Rate Usecnt Name
------------------------------------------
40000000 0 |-- fxosc@40050000
51000000 0 |-- firc
32000 0 |-- sirc
20000000 0 |-- ftm0_ext
20000000 0 |-- ftm1_ext
125000000 0 |-- gmac0_ext_rx
125000000 0 |-- gmac0_ext_tx
50000000 0 |-- gmac0_rmii_ref
200000000 0 |-- gmac0_ext_ts
100000000 0 |-- serdes_100_ext
125000000 0 |-- serdes_125_ext
125000000 0 |-- serdes0_lane0_ext_cdr
125000000 0 |-- serdes0_lane0_ext_tx
125000000 0 |-- serdes0_lane1_ext_cdr
125000000 0 |-- serdes0_lane1_ext_tx
125000000 0 |-- serdes1_lane0_ext_cdr
125000000 0 |-- serdes1_lane0_ext_tx
125000000 0 |-- serdes1_lane1_ext_cdr
125000000 0 |-- serdes1_lane1_ext_tx
1 0 |-- pfe_mac0_rmii
1 0 |-- pfe_mac1_rmii
1 0 |-- pfe_mac2_rmii
1300000000 0 |-- a53
396618429 2 |-- serdes_axi
51000000 2 |-- serdes_aux
132206143 2 |-- serdes_apb
100000000 2 |-- serdes_ref
80000000 0 |-- ftm0_sys
40000000 0 |-- ftm0_ext
80000000 0 |-- ftm1_sys
40000000 0 |-- ftm1_ext
132206143 0 |-- flexcan_reg
132206143 0 |-- flexcan_sys
80000000 0 |-- flexcan_can
198309214 0 |-- flexcan_ts
62500000 0 |-- linflex_xbar
125000000 1 |-- linflex_lin
200000000 0 |-- gmac0_ts
125000000 0 |-- gmac0_rx_sgmii
125000000 0 |-- gmac0_tx_sgmii
125000000 1 |-- gmac0_rx_rgmii
125000000 1 |-- gmac0_tx_rgmii
125000000 0 |-- gmac0_rx_rmii
125000000 0 |-- gmac0_tx_rmii
125000000 0 |-- gmac0_rx_mii
125000000 0 |-- gmac0_tx_mii
396618429 1 |-- gmac0_axi
100000000 0 |-- spi_reg
100000000 0 |-- spi_module
132206143 0 |-- qspi_reg
132206143 0 |-- qspi_ahb
400000000 0 |-- qspi_flash2x
200000000 0 |-- qspi_flash1x
396618429 0 |-- usdhc_ahb
132206143 0 |-- usdhc_module
400000000 1 |-- usdhc_core
32000 0 |-- usdhc_mod32k
132206143 0 |-- ddr_reg
800000000 0 |-- ddr_pll_ref
800000000 0 |-- ddr_axi
396618429 0 |-- sram_axi
132206143 0 |-- sram_reg
132206143 0 |-- i2c_reg
132206143 0 |-- i2c_module
66103071 0 |-- siul2_reg
51000000 0 |-- siul2_filter
132206143 0 |-- crc_reg
132206143 0 |-- crc_module
130000000 0 |-- eim0_reg
130000000 0 |-- eim0_module
66103071 0 |-- eim123_reg
66103071 0 |-- eim123_module
66103071 0 |-- eim_reg
66103071 0 |-- eim_module
66103071 0 |-- fccu_module
51000000 0 |-- fccu_safe
66103071 0 |-- rtc_reg
32000 0 |-- rtc_sirc
51000000 0 |-- rtc_firc
132206143 0 |-- swt_module
51000000 0 |-- swt_counter
132206143 0 |-- stm_module
132206143 0 |-- stm_reg
132206143 0 |-- pit_module
132206143 0 |-- pit_reg
396618429 0 |-- edma_module
396618429 0 |-- edma_ahb
80000000 1 |-- sar_adc_bus
66103071 0 |-- cmu_module
66103071 0 |-- cmu_reg
132206143 0 |-- tmu_module
132206143 0 |-- tmu_reg
132206143 0 |-- flexray_reg
0 0 |-- flexray_pe
66103071 0 |-- wkpu_module
66103071 0 |-- wkpu_reg
66103071 0 |-- src_module
66103071 0 |-- src_reg
66103071 0 |-- src_top_module
66103071 0 |-- src_top_reg
132206143 0 |-- ctu_module
80000000 0 |-- ctu_ctu
198309214 0 |-- dbg_sys4
396618429 0 |-- dbg_sys2
396618429 0 |-- m7
132206143 0 |-- dmamux_module
132206143 0 |-- dmamux_reg
650000000 0 |-- gic_module
132206143 0 |-- mscm_module
132206143 0 |-- mscm_reg
132206143 0 |-- sema42_module
132206143 0 |-- sema42_reg
66103071 0 |-- xrdc_module
66103071 0 |-- xrdc_reg
0 0 |-- clkout0
0 0 |-- clkout1
99154607 0 |-- usb_mem
32000 0 |-- usb_low
0 0 |-- pfe0_rx_sgmii
0 0 |-- pfe0_tx_sgmii
0 0 |-- pfe0_rx_rgmii
0 0 |-- pfe0_tx_rgmii
0 0 |-- pfe0_rx_rmii
0 0 |-- pfe0_tx_rmii
0 0 |-- pfe0_rx_mii
0 0 |-- pfe0_tx_mii
0 0 |-- pfe1_rx_sgmii
0 0 |-- pfe1_tx_sgmii
0 0 |-- pfe1_rx_rgmii
0 0 |-- pfe1_tx_rgmii
0 0 |-- pfe1_rx_rmii
0 0 |-- pfe1_tx_rmii
0 0 |-- pfe1_rx_mii
0 0 |-- pfe1_tx_mii
0 0 |-- pfe2_rx_sgmii
0 0 |-- pfe2_tx_sgmii
0 0 |-- pfe2_rx_rgmii
0 0 |-- pfe2_tx_rgmii
0 0 |-- pfe2_rx_rmii
0 0 |-- pfe2_tx_rmii
0 0 |-- pfe2_rx_mii
0 0 |-- pfe2_tx_mii
300000000 1 |-- pfe_axi
300000000 0 |-- pfe_apb
600000000 1 |-- pfe_pe
200000000 0 |-- pfe_ts
80000000 0 |-- llce_can_pe
198309214 0 |-- llce_sys
80000000 0 `-- llce_per
1300000000 0 |-- a53
396618429 2 |-- serdes_axi
51000000 2 |-- serdes_aux
132206143 2 |-- serdes_apb
100000000 2 |-- serdes_ref
80000000 0 |-- ftm0_sys
40000000 0 |-- ftm0_ext
80000000 0 |-- ftm1_sys
40000000 0 |-- ftm1_ext
132206143 0 |-- flexcan_reg
132206143 0 |-- flexcan_sys
80000000 0 |-- flexcan_can
198309214 0 |-- flexcan_ts
62500000 0 |-- linflex_xbar
125000000 1 |-- linflex_lin
200000000 0 |-- gmac0_ts
125000000 0 |-- gmac0_rx_sgmii
125000000 0 |-- gmac0_tx_sgmii
125000000 1 |-- gmac0_rx_rgmii
125000000 1 |-- gmac0_tx_rgmii
125000000 0 |-- gmac0_rx_rmii
125000000 0 |-- gmac0_tx_rmii
125000000 0 |-- gmac0_rx_mii
125000000 0 |-- gmac0_tx_mii
396618429 1 |-- gmac0_axi
100000000 0 |-- spi_reg
100000000 0 |-- spi_module
132206143 0 |-- qspi_reg
132206143 0 |-- qspi_ahb
400000000 0 |-- qspi_flash2x
200000000 0 |-- qspi_flash1x
396618429 0 |-- usdhc_ahb
132206143 0 |-- usdhc_module
400000000 1 |-- usdhc_core
32000 0 |-- usdhc_mod32k
132206143 0 |-- ddr_reg
800000000 0 |-- ddr_pll_ref
800000000 0 |-- ddr_axi
396618429 0 |-- sram_axi
132206143 0 |-- sram_reg
132206143 0 |-- i2c_reg
132206143 0 |-- i2c_module
66103071 0 |-- siul2_reg
51000000 0 |-- siul2_filter
132206143 0 |-- crc_reg
132206143 0 |-- crc_module
130000000 0 |-- eim0_reg
130000000 0 |-- eim0_module
66103071 0 |-- eim123_reg
66103071 0 |-- eim123_module
66103071 0 |-- eim_reg
66103071 0 |-- eim_module
66103071 0 |-- fccu_module
51000000 0 |-- fccu_safe
66103071 0 |-- rtc_reg
32000 0 |-- rtc_sirc
51000000 0 |-- rtc_firc
132206143 0 |-- swt_module
51000000 0 |-- swt_counter
132206143 0 |-- stm_module
132206143 0 |-- stm_reg
132206143 0 |-- pit_module
132206143 0 |-- pit_reg
396618429 0 |-- edma_module
396618429 0 |-- edma_ahb
80000000 1 |-- sar_adc_bus
66103071 0 |-- cmu_module
66103071 0 |-- cmu_reg
132206143 0 |-- tmu_module
132206143 0 |-- tmu_reg
132206143 0 |-- flexray_reg
0 0 |-- flexray_pe
66103071 0 |-- wkpu_module
66103071 0 |-- wkpu_reg
66103071 0 |-- src_module
66103071 0 |-- src_reg
66103071 0 |-- src_top_module
66103071 0 |-- src_top_reg
132206143 0 |-- ctu_module
80000000 0 |-- ctu_ctu
198309214 0 |-- dbg_sys4
396618429 0 |-- dbg_sys2
396618429 0 |-- m7
132206143 0 |-- dmamux_module
132206143 0 |-- dmamux_reg
650000000 0 |-- gic_module
132206143 0 |-- mscm_module
132206143 0 |-- mscm_reg
132206143 0 |-- sema42_module
132206143 0 |-- sema42_reg
66103071 0 |-- xrdc_module
66103071 0 |-- xrdc_reg
0 0 |-- clkout0
0 0 |-- clkout1
99154607 0 |-- usb_mem
32000 0 |-- usb_low
0 0 |-- pfe0_rx_sgmii
0 0 |-- pfe0_tx_sgmii
0 0 |-- pfe0_rx_rgmii
0 0 |-- pfe0_tx_rgmii
0 0 |-- pfe0_rx_rmii
0 0 |-- pfe0_tx_rmii
0 0 |-- pfe0_rx_mii
0 0 |-- pfe0_tx_mii
0 0 |-- pfe1_rx_sgmii
0 0 |-- pfe1_tx_sgmii
0 0 |-- pfe1_rx_rgmii
0 0 |-- pfe1_tx_rgmii
0 0 |-- pfe1_rx_rmii
0 0 |-- pfe1_tx_rmii
0 0 |-- pfe1_rx_mii
0 0 |-- pfe1_tx_mii
0 0 |-- pfe2_rx_sgmii
0 0 |-- pfe2_tx_sgmii
0 0 |-- pfe2_rx_rgmii
0 0 |-- pfe2_tx_rgmii
0 0 |-- pfe2_rx_rmii
0 0 |-- pfe2_tx_rmii
0 0 |-- pfe2_rx_mii
0 0 |-- pfe2_tx_mii
300000000 1 |-- pfe_axi
300000000 0 |-- pfe_apb
600000000 1 |-- pfe_pe
200000000 0 |-- pfe_ts
80000000 0 |-- llce_can_pe
198309214 0 |-- llce_sys
80000000 0 |-- llce_per
0 Kudos
Reply