MCF5234 QSPI via DMA

sirlenzelot · ‎09-27-2022

Hello,

did someone manage to get qspi via dma running? I am somewhat stuck.

This is what I need to do:

My qspi runs on 6 addresses an delivers an receives 16bit of data each.
Additionaly I need to transfer (tx&rx) 128 bytes of data en block. I modified the endq and newq to address 7 which is setup to the right chip select. As I start the transfer I change the bitwidth from 16bit to 8bit.
I tried to send the data using a loop (which is way too slow) and using the qspi irq to reload the data (too slow either). Now I'd like to switch to dma transfer. Unfortunately the ref man is rather silent on the combination of qspi and dma.

Any help or hint is highly appreciated.

Thanks in advance!

TomE · ‎09-28-2022

Devices which support DMA have DMA request bits. These are defined in "Table 14-2. DMAREQC Field Description (Continued)". The only devices that can request DMA are the four DMA Timers and the three UARTs. Nothing else can.

However, there's nothing from stopping you starting a "Dual Address DMA Transfer" to copy a block of data (the QSPI registers) to and from memory, triggered by software. But be aware that anything the DMA Controller can do, the CPU can do a lot faster. It takes longer to set up a DMA transfer (of say 16 words in the QSPI) that to have the CPU do the transfer itself, so it isn't worthwhile.

The QSPI supports a maximum of 16 bit transfers. That's limited by the "Receive RAM" being 16 bits. You can have longer "external" transfers like 32-bits by having the external chip-select remain asserted during two 16-bit transfers.

Does your "external 128 byte RX and TX transfer" have to be 8 bit transfers? Does the external device need chip-select deasserted every 8 bits? If it can handle longer transfers, you can do better, but assuming it can't...

The best you can do is to load up the Command and Transmit RAM for your 6-word transfer, and then do that. After that has finished, load up all 16 with 16 transfers (command and transmit data) to your block device and have it run once. Then interrupt, read the 16 bytes of data, write a new 16 bytes and kick it off again. Repeat 8 times. Or 4 times if you can send 16-bit SPI sequences. Then load up for the 6-word transfer devices and run that.

Before you say "too slow", this is a 150MHz CPU. It can perform those interrupts and reloads pretty quickly (as long as you've got the cache enabled). The slowest part of the whole thing is the unexpectedly long time it takes to read and write the individual RAM locations in the QSPI. They probably take 20 clocks each or so.

By "too slow" do you mean your "six devices" need servicing at something like 100kHz?

I've got code here that is using the QSPI to control three 8-channel ADCs, performing 96 conversions (interleaved data and a zero channel) at 4kHz. It has to reload the QSPI six times for each conversion. So that's "load/burst/load/burst..." and so on. Here's what that looks like:

You can see from the cursors that the 16-word burst took 23us. All six of them together (from another capture I have) took 153us. In the small gap between each of the six bursts, the CPU was interrupted, read 16 results, loaded 16 new transmit commands and started the next one. In about 3us. Meaning a 13% overhead over what a "perfect DMA transfer" could do. The SPI is running at 12.5MHz.

On another product I had the QSPI handling one ADC device plus three MCP2515 CAN controllers. Those things need a huge amount of SPI transfers to handle the CAN protocol. It all worked fine. This CPU is really fast if you're programming it properly.

Tom

View solution in original post

sirlenzelot · ‎09-29-2022

Hello Mike, hello Tom,
thanks for your replies.
As i read the manual over and over again, I finally decided to not use DMA which (according to your answer Tom) seems to be the right decision.

The target device can only handle 8bit transfers and it needs the chip select all over the 128 byte.
I now setup the QSPI to use 4 Ports. Unfortunately after the 4 transfers the CS is deasserted. I am to figure out, which ist the best way to handle the QSPI-CS. Maybe i disconnect them from the qspi and handle them manually. I also think of extending to all 16 ports.

That "too slow" came from my observation that i was not able to reload the qspi-ram early enough in warp mode before the next transfer started. So there always was old data on the bus.

Historically we have never dealt with cache. It was said that cache is very complicated and faulty. So this was kind of forbidden magic. Maybe I should have a look at that, too. Besides that we run our software from SDRAM which is also a handbrake. We need to deal the 128 Byte every 1 ms parallel to the other six 16-bit Ports which just run around in wrap mode and are evaluated on demand, which is also 1 ms.

I apologize for my lack of knowledge. This is the first time I have to deal with the QSPI.

So thanks for help and the good suggestions.

best regards.

TomE · ‎09-29-2022

> Historically we have never dealt with cache.

If you don't have the cache on you completely cripple the CPU. The first time I programmed on a CPU with cache (50MHz MPC860) it ran SLOWER THAN A 16MHz MC68302 until we turned the instruction cache on.

Without the cache you're dropping from 150MHz to whatever the DRAM read time is. Which on these CPUs is probably about 9MHz (8 clocks to do one read at 75MHz). Do you really want to run the CPU 16 times slower (8 times slower on a 32-bit DRAM bus)?The next (huge) problem is that the DRAM controller on these chips WILL NOT BURST unless you're reading into Cache [1]. So the CPU instruction fetches are probably one-at-a-time. With the instruction cache on it does "burst", so your instruction read speed is nearly 4 times faster (4 words or 16 bytes in 11 clocks). That's even more important if you only have a 16 bit wide RAM on the thing (I hope it is 32 bits wide).

Yes, managing a Data Cache is a little tricky. There are three ways to do it, "Small", "Properly" and "Cheating". The "proper" way needs all the cache flushing and invalidating for the DMA devices to work. Basically that's USB and Ethernet. The "Cheat" is to declare one memory region to be uncached in the cache registers, and declare all of the USB and Ethernet buffers to be in there. The "Small" way is do all of your DMA to and from the internal 64k SRAM which isn't cached, but is really fast. That's REALLY where your CPU Stack should be too. The CPU runs a lot faster with the stack in the SRAM, even with the data cache enabled.

But you don't need any of this to use the Instruction Cache. You just load your code into DRAM and turn the cache on. Or turn the cache on and then load the code. That gets you a huge boost. I really hope you're doing that already.

There are also two ways to get the "Chip select asserted for the whole thing". The easiest is to run SPI the brain-dead way a lot of Linux drivers do. Using the hardware chip selects is "too hard to do generally" so the drivers default to using a GPIO as the chip-select. So it is "turn the GPIO on, perform the transfer through the SPI peripheral (in your case for 128 bytes) and then turn it off again". Perfectly fine for your case, but abysmal for a Linux driver when you're needing to perform transfers with the chip-select bouncing up and down for each byte. It runs terribly slowly then.

The PROPER way to do the transfer is to read up on the "QCR:CONT" bit in the command RAM. If the CONT bit is set the chip select remains asserted. That's REALLY good as you can perform 16-bit transfers. So you only need to perform four 16-word transfers to get 128 bytes. You set CONT in all command locations for the first three, and all but the last Command RAM location for the last one. That will perform the transfer the way you want it done.

You can still run it "continuously converting" for the other 6 things if you like, but it would be easier to just run that once when you need it, then run the 128-byte transfer once. Repeat every millisecond.

Note 1:You should be able to leave the SDRAM Page open when bursting, reading one word per bus clock if reading sequentially. This CPU can't do that. It can only perform a maximum of 16-byte bursts (4 32-bit cycles). That drops the maximum read rate from what the SDRAM can support (300MB/s) to about 100 MB/s. That's with the instruction cache working.

Tom

sirlenzelot · ‎10-04-2022

Hi Tom,

thanks for your reply.

Indeed we never enabled cache. Neither instruction nor data. Maybe I should have a look at the instruction cache at least..

For my SPI problem, I have switched to "braindead" mode for testing.

best regards

TomE · ‎10-04-2022

> Indeed we never enabled cache. Neither instruction nor data. Maybe I should
> have a look at the instruction cache at least.

What did you use as the basis for your code? What example projects did you start from, and why didn't they enable the instruction cache? Without that, you're running at about the same speed as a 1980's 68000.

I'd suggest you write some test code to see how fast the CPU can perform some basic operations, preferably in assembly so the compiler doesn't optimise your test code away. Then you'll have something to compare with the cache on and off. DON'T write code to toggle a GPIO pin as this takes a long time on this CPU (as the GPIO module runs on a slower clock). Do some "arithmetic" or spin in a loop or something.

We're using gcc to compile our code, and are running our own "bare metal" code. The initial startup in our case is in the file "crt0.s". It is responsible for:

Set up the stack pointer
Set up RAMBAR (and make it dual ported)
Set up dasic GPIOs
Set up FLASH Chip Selects properly (they start off really slow)
Configure the clock (set up the PLL and enable it)
Configure the SDRAM controller and the SDRAM control register
Enable the instruction cache
Zero BSS and FAST BSS (in SRAM)
Initialise the C Libraryh
Jump to main()

You have to have code there somewhere that does most of the above. It should be initialising the Instruction Cache. Check to see that it isn't doing that already somewhere.

Here's all it takes to turn the cache on. I have no idea why that "wait a bit" section is in there. Our SDRAM starts at 0x80000000, hence that address being written to ACR1.

4:
  /* Setup cache */
  move.l #0x01000000,%d0
  movec %d0,%CACR /* invalidate cache */

 /* wait a bit */
 move.l #2000, d0
10:
  nop
  sub.l #1, d0
  jbne 10b

  move.l #0x00000000,%d0 /* set flash cache off */
  movec %d0,%ACR0
  nop
  move.l #0x8000C000,%d0 /* set sdram cache on */
  movec %d0,%ACR1
  nop
/*  move.l #0x80000602,%d0 */ /* I and D, default off, allow non-cache burst*/
  move.l #0x80400602,%d0 /* I only, default off, allow non-cache burst*/
  movec %d0,%CACR /* enable cache */
  nop

sirlenzelot · ‎10-05-2022

Hi Tom,

thanks for explanation.
Historically we are coming from 68000. As I started working at the company they just had switched to coldfire. In those days I asked about cache and it was "forbidden" to use. Over the years I did not even think about it. Shame on me

We use Mentor Graphics' Nucleus as RTOS and 99% of our code is written in c++. The compiler is made by Mentor, too.

I was digging through the base init of the cpu and found some code for enabling the cache. After some modification the instruction cache is now enabled. As a result the code part I am actually working on now runs at double speed. I really need to do some deeper inspection and measurements.

As this is going slightly off-topic, I think we can close here.
QSPI via DMA is not possible.

Thanks again and best regards.

TomE · ‎10-06-2022

> We use Mentor Graphics' Nucleus as RTOS

That''s good. They should handle the caches automatically and give documentation on how to use them. There may even be a "project based option" to build with cache support or not. If the OS loads executables into RAM there's a small risk of loading code "under" the instruction cache, so anything that does that should invalidate the instruction cache on code load. Nucleus should do that (as a configurable option).

They also run tasks/threads with separate stacks. So the stacks would be in SDRAM, probably in the Heap. That halves execution speed (versus the internal SRAM). But you're running a complicated OS that probably can't give you that option. I've measured my code running 1.8 times faster with the stack in SRAM versus being in CACHED SDRAM! So it makes a big difference.

> After some modification the instruction cache is now enabled. As a result the code
> part I am actually working on now runs at double speed.

I'd expect a factor of 10 at least. That's the speedup I'd expect if the Instruction Cache was already enabled and you turned on the Data Cache.

> I really need to do some deeper inspection and measurements.

I'd agree. Is your product using Ethernet? That (and the 4 internal DMA controllers) are the only things that would be affected by enabling the Data Cache. I'd expect Nucleus should be able to handle this, or should clearly document that they don't on this platform.

Tom

TomE · ‎09-28-2022