Arm specific implementation of memcpy

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 
已解决

Arm specific implementation of memcpy

跳至解决方案
4,711 次查看
gertvb
Contributor III

Good Day All!!

I am working on the LPC55S69 and MCUXpresso to program in C.

In the stuff I do, I need to do quite a bit of block copying of values between arrays, eg copying like 893 uint32_t values from a start address to a destination address.

I am using the standard memcpy that comes with MCUXpresso, and it works fine, except that it takes quite a few clock cycles to do the action.

Is there a faster implementation or algorithm for block copying available than what is used in MCUXpresso?

My understanding is that the standard memcpy should be heavily optimsed, and is the fastest available

Kind regards

Gert van Biljon

A TechExplorer working with Embedded Software and Electronics in Agriculture and Alternative Energy
0 项奖励
回复
1 解答
4,705 次查看
frank_m
Senior Contributor III

As long as the source and target addresses are 32-bit aligned and you copy 32-bit values, there is not much room for beating the compiler/optimizer. This is at least my experience.

This was the whole point of the RISC architectures which includes ARM. The designers realised that memory access was relatively costly in regards to time, so the whole instruction set evolves around doing as much as possible with registers, and as little as possible actual memory accesses.

On another level, hardware-specific details come into play. The RAM of modern MCUs is often segmented, and located on different busses with different access characteristics (i.e. DMA possible or not). You could place often frequented data on faster busses with less competing masters. You would need to consult the datasheet / reference manual of the MCU for details. And such optimisations are not very portable.

One could move larger blocks per DMA, if the setup effort does not negate the gains.

 

在原帖中查看解决方案

3 回复数
4,670 次查看
loudpotato
Contributor I

Hi. ARMs support LDM/STM instruction to load/save a bunch of registers (e.g. 8). It takes 9 cycles for 8 registers compared to 16 cycles with LDR or STR. And they can adjust the address too. If your memcpy uses those, I don't think you can do any better on CPU.

0 项奖励
回复
4,691 次查看
converse
Senior Contributor V

The version of memcpy in newlib is pretty well optimised, and provided by ARM (just search for "arm memcpy newlib"). The only thing to be aware of is that (I think) newlib is optimised for size and not performance. So, if performance is really that critical, then use an optimise-for-speed version.

4,706 次查看
frank_m
Senior Contributor III

As long as the source and target addresses are 32-bit aligned and you copy 32-bit values, there is not much room for beating the compiler/optimizer. This is at least my experience.

This was the whole point of the RISC architectures which includes ARM. The designers realised that memory access was relatively costly in regards to time, so the whole instruction set evolves around doing as much as possible with registers, and as little as possible actual memory accesses.

On another level, hardware-specific details come into play. The RAM of modern MCUs is often segmented, and located on different busses with different access characteristics (i.e. DMA possible or not). You could place often frequented data on faster busses with less competing masters. You would need to consult the datasheet / reference manual of the MCU for details. And such optimisations are not very portable.

One could move larger blocks per DMA, if the setup effort does not negate the gains.