RT1176 LPSPI QSPI delay time in between transfers

rhsalced · ‎03-05-2024

Hello,

I’m working with an RT1176 which is interfacing with an FPGA. The ARM and FPGA and connected via a QSPI interface. Where the FPGA is the master and the ARM is the slave. The ARM is using the LPSPI 1 option. The example SDK project that I started with is the lpspi/polling_b2b_transfer/slave/cm7 I’ve modified this project to configure the LPSPI 1 driver as QSPI.

I’ve successfully sent data between these devices and have moved on to evaluating the effective bandwidth.

The current test that I am evaluating is contiguously sending one 32 bit word followed by receiving one 32 bit word. The effective bandwidth is 40 mbps. There is obviously down time in between each transfer. I’m curious if the down time between transfers is variable. Since the FPGA is the master, it has no idea when the ARM is ready for the next write/read. This could lead to potential issues where the ARM isn’t ready for the next command and the FPGA starts sending the next transfer. For example, if the FPGA sends a write immediately followed by a read where the delay between these two transfers is too short. The ARM might not be ready for the read.

The FPGA waits 600 ns after sending write before sending the clock cycles for a read. Then waits 500 ns before sending the next write.

My question; is the down time in between transfers variable? For example, if there are shared resources between LPSPI drivers and multiple LPSPI drivers are firing at once could this potentially lead to longer down time in between transfers? Again, the FPGA doesn’t know when the ARM is ready for the next transfer so having a deterministic delay in between transfers is crucial for the reliability of the data.

Thanks,

Ricardo

MultipleMonomials · ‎03-05-2024

Well, yes, the turnaround time needed is going to depend quite a lot on your application. How the polling b2b transfer example works is, it calls LPSPI_SlaveTransferNonBlocking(), which configures the LPSPI peripheral with the correct direction (Tx/Rx) and data length, and then it fills up the FIFO with the data to send (in this case just one byte for Tx or 0 bytes for Rx). Once the transfer happens, the LPSPI interrupt is triggered, which sets the transfer-complete flag and allows the main loop code to continue.

This means that the time until the next transfer is dependent on how quickly the CPU can get the interrupt and advance to the next LPSPI_SlaveTransferNonBlocking() call. Unfortunately, the Cortex-M7 is a non-deterministic CPU, and there are also data and instruction caches in play which can further add non-determinism. So, it's difficult to rely on the CPU to execute this code in any specific amount of time.

If you will not be using an RTOS and intend to just make the code run in a loop only ever executing this, then you can kinda sorta get away with it, by using e.g. a GPIO to measure how quickly it gets to the next "while (!isTransferCompleted)" loop after a byte is transferred. If you measure this lots of times and take the maximum, then that should be a reasonably good number.

However, if your MCU is doing literally anything else, such as running other RTOS threads or handling interrupts from any other peripherals, then you can forget about it. You will never be able to make the CPU react deterministically to something that happens in just a few hundred nanoseconds, as it can take a microsecond or more just for an RTOS context switch!

For a more reliable way of doing this kind of thing, I would highly recommend using DMA. The DMA controller can be configured to write data to the SPI Transmit Command FIFO and Transmit Data FIFO (thereby setting up a transaction), and then read the results from the Rx FIFO. If you have the flexibility to change the FPGA-side interface, that would make it easier, because you could (a) use regular SPI instead of QSPI or (b) use two different QSPI busses, one for Tx and one for Rx [assuming there are enough pins on the MIMXRT side, I'm not sure]. That would make the DMA setup way simpler as you could either run one bus as full duplex, or run two busses as half duplex each, instead of having to continually switch the direction of one bus back and forth, which is not so easy.

In my experience, the reaction time of DMA can generally be relied upon to be under one microsecond (though even that can be non-deterministic because the DMA shares a data bus with the CPU so it sometimes has to stall). But if you run SPI as full-duplex, then the DMA can take advantage of the SPI's Tx and Rx FIFO, so it won't need to react in hundreds of nanoseconds all the time as the peripheral will be able to cache data in its buffers.

Anyway, hope that is useful!