Memory to Memory DMA on i.MX Rt 1020

mjbcswitzerland · ‎11-26-2019

Hi All

I have been doing some memory-to-memory DMA tests on the i.MX RT 1020 and found that I could transfer between OCRAM, ITC and DTC in any combination.

To see what happened I also tried transferring from QSPI Flash memory to one of these internal RAM areas, expecting that there would be an error However it also worked.

How is this possible? Does the DMA transfer attempt trigger a read from the QSPI Flash stall until ready, and then performs the transfer? Is it really as clever as that or is there another explanation?

I measured the time it took to perform a DMA transfer of 1024 bytes (long word transfer units) between the 4 areas:

OCRAM -> OCRAM                  26.36us
OCRAM -> ITC                    26.8us
OCRAM -> DTC                    24.36us

ITC -> OCRAM                    28.44us
ITC -> ITC                      26.36us
ITC -> DTC                    26.36us

DTC -> OCRAM                    28.44us
DTC -> ITC                      26.36us
DTC -> DTC                    26.36us

QSPI-Flash (125MHz) -> OCRAM    40.95us
QSPI-Flash (125MHz) -> ITC      38.93us
QSPI-Flash (125MHz) -> DTC      38.93us

QSPI-Flash (125MHz) -> QSPI-Flash (125MHz) 159.5us

All transfers resulted with the an accurate copy of the source image at the destination after the transfer had completed, apart from the final case where no change resulted in the QSPI-Flash; however the DMA transfer did not terminate with an error.

Can anyone comment on the transfer speeds? When I repeat the test the results are always exactly the same (the times include the time to set up the DMA transfer [around 15us] as well as the actual data copy). I have the Cortex M7 core clocked at 500MHz, IPG 125MHz and FlexSPI 125MHz. Cache is disabled and code is from QSPI-Flash. And in particular the QSPI Flash results (how does it even work????)!

Regards

Mark

Yuri · ‎11-28-2019

Hello,

Use app note "i.MX RT Series Performance Optimization". In particular,

in section 3.2 (FlexSPI performance):

The FlexSPI supports the eXecute-In-the-Place (XIP) on that connected NOR flash.

The following enhanced features of FlexSPI help to improve the performance.
• System cache(32 k DCACHE and 32 K ICACHE)
• AHB buffer, 8*64 bit TX AHB buffer and 128*64bit RX AHB buffer

https://www.nxp.com/docs/en/application-note/AN12437.pdf

Have a great day,

Yuri

-------------------------------------------------------------------------------

Note:

- If this post answers your question, please click the "Mark Correct" button. Thank you!

- We are following threads for 7 weeks after the last post, later replies are ignored

Please open a new thread and refer to the closed one, if you have a related question at a later point in time.

mjbcswitzerland · ‎11-27-2019

Hi All

Previously I was testing without cache so I have now enabled caching (so that there are generally none, or very few accesses to QSPI Flash (program code) during general operation).

Also I have measured only the DMA transfer time (without the DMA setup).

I am intrigued by the DMA from QSPI Flash still, as shown by these two recordings:

1. I transfer 1024 bytes (as long word units) from DTC to ITC and measure 16.64us for the transfer (one long word transfer each 65ns, although the code to toggle the test output may be degrading the measured performance a bit.

On the QSPI bus i see chip selects (top line) and clocks (middle line) around the DMA transfer. Presumably the code that is executed needs to be fetched since it is not in cache.

2. Now I do the same from QSPI Flash to ITC (after the transfer I can verify that the ITC destination has the data that is in the QSPI Flash). I choose a location in QSPI Flash that has not been used by code or read before so that it is (hopefully not cached).

This time the DMA transfer takes slightly longer at 21.36us but the QSPI Flash access via the FlexSPI is not really any different.

As noted, the content in that area of QSPI Flash is now in ITC but what actually happens during the DMA process? How can the QSPI Flash content be moved there if it hasn't been read via the FlexSPI bus?

I was originally expecting attempted DMA for QSPI memory space to cause an error but it has worked. Therefore I expected that it must be reading in the QSPI flash content via FlexSPI during the process - this would be visible and slow down the transfer to at least an order of magnitude greater that the internal transfers. But again I can't see any these happening.

What is the explanation for this operation/behavior?

Regards

Mark

mjbcswitzerland · ‎11-27-2019

All All

New update.

I switched form transferring 1k blocks to 32kByte blocks and now I am seeing a difference:

DTM to OCRAM transfer of 32k taking about 513us.
I notice that shortly after the transfer starts the FlexSPI activity stops; the code is spinning waiting for the DMA controller to flag that it has completed, which will be cached at this pint, with no reason to access the QSPI Flash chip.

QSPI Flash to OCRAM transfer of 32k taking about 657us (some 144us, or +28% longer).
The interesting difference is that during the wait the QSPI Flash is being accessed, which presumably means that the data is being retrieved so that the DMA controller can then perform/complete the transfer.

This is more in line with expectations but the picture is not visible with short transfers. Also the speed reduction is not great.

In the memory to memory transfer case a data rate of around 65MByte/s is seen and in the QSPI to memory it drops to about 48MBytes/s.

Now I am questioning the speed of memory to memory DMA transfers, so I repeated but instead of doing long word DAM transfers I did byte for byte CPU (memcpy()) transfers instead:

OCRAM to OCRAM takes 240us (133MBytes/s)

QSPI Flash to OCRAM takes 527us (60MByte/s)

Again it can be seen that data from the QSPI Flash is accessed during the process (via the FlexSPI bus).

Since I have 125MHz FlexSPI speed this doesn't seem unrealistic

But is does show that it is faster to d a simple memcpy() that to do a memory to memory transfer via DMA. At the moment I use the same code that was used in Kinetsi projects which had a good speed advantage in comparison with memcpy() so this looks like it needs to be investigated: is it a restriction that can be overcome in the i.MX RT??

Regards

Mark

Memory to Memory DMA on i.MX Rt 1020

Memory to Memory DMA on i.MX Rt 1020

i.MXRT 102x