Arm specific implementation of memcpy

gertvb · ‎05-12-2021

Good Day All!!

I am working on the LPC55S69 and MCUXpresso to program in C.

In the stuff I do, I need to do quite a bit of block copying of values between arrays, eg copying like 893 uint32_t values from a start address to a destination address.

I am using the standard memcpy that comes with MCUXpresso, and it works fine, except that it takes quite a few clock cycles to do the action.

Is there a faster implementation or algorithm for block copying available than what is used in MCUXpresso?

My understanding is that the standard memcpy should be heavily optimsed, and is the fastest available

Kind regards

Gert van Biljon

A TechExplorer working with Embedded Software and Electronics in Agriculture and Alternative Energy

frank_m · ‎05-12-2021

As long as the source and target addresses are 32-bit aligned and you copy 32-bit values, there is not much room for beating the compiler/optimizer. This is at least my experience.

This was the whole point of the RISC architectures which includes ARM. The designers realised that memory access was relatively costly in regards to time, so the whole instruction set evolves around doing as much as possible with registers, and as little as possible actual memory accesses.

On another level, hardware-specific details come into play. The RAM of modern MCUs is often segmented, and located on different busses with different access characteristics (i.e. DMA possible or not). You could place often frequented data on faster busses with less competing masters. You would need to consult the datasheet / reference manual of the MCU for details. And such optimisations are not very portable.

One could move larger blocks per DMA, if the setup effort does not negate the gains.

View solution in original post

loudpotato · ‎05-13-2021

Hi. ARMs support LDM/STM instruction to load/save a bunch of registers (e.g. 8). It takes 9 cycles for 8 registers compared to 16 cycles with LDR or STR. And they can adjust the address too. If your memcpy uses those, I don't think you can do any better on CPU.

converse · ‎05-12-2021

The version of memcpy in newlib is pretty well optimised, and provided by ARM (just search for "arm memcpy newlib"). The only thing to be aware of is that (I think) newlib is optimised for size and not performance. So, if performance is really that critical, then use an optimise-for-speed version.

frank_m · ‎05-12-2021

As long as the source and target addresses are 32-bit aligned and you copy 32-bit values, there is not much room for beating the compiler/optimizer. This is at least my experience.

This was the whole point of the RISC architectures which includes ARM. The designers realised that memory access was relatively costly in regards to time, so the whole instruction set evolves around doing as much as possible with registers, and as little as possible actual memory accesses.

On another level, hardware-specific details come into play. The RAM of modern MCUs is often segmented, and located on different busses with different access characteristics (i.e. DMA possible or not). You could place often frequented data on faster busses with less competing masters. You would need to consult the datasheet / reference manual of the MCU for details. And such optimisations are not very portable.

One could move larger blocks per DMA, if the setup effort does not negate the gains.