SDMA data corrupt under CPU load

tarik-dzananovic-sn · ‎12-19-2022

Hi all,

If somebody could help me with following problem.

We are using iMX8M Nano Quad Core as our main board processor.
It has running linux version: imx_5.4.70_2.3.0.

After running for indefinite period of time we can see that our SPI channels (ecspi1 and ecspi2) receive corrupt RX data (Either overwritten or shifted, it fails checksum check). If we put CPU under load (simply by running linux "yes" command) we can see errors occur more often.

For experimental use we have also tried updating our linux kernel to imx_5.15.77_2.1.0 but we can still see the error. Since it affects both SPI channels, and when adding logs to our kernel drivers we can see that it receives corrupt data after calling spi_sync method, we are thinking that SDMA somehow corrupts the data. Did you had any similar situation, or you suspect what could possibly be wrong?

Thanks in advance.

Regards
//Tarik

tarik-dzananovic-sn · ‎12-19-2022

Hi Christian,

First of all thanks for your reply. Our problems seem similar, but am not quite sure that they are the same.

Yes there are multiple peripherals connected (3 UARTS, 1st one for serial console, when connected, 2nd for programming of auxiliary processor when fw update is needed, 3rd one not used actively). Also haven't seen issues on UART (but is not as frequently used, nor essential like SPI), so can't say for sure there are no problems but we haven't seen anything that would cause our attention.

I have never received any errors related to SPI timeout or similar, it always returned success, but data under certain conditions (mentioned in previous post) get corrupt. SPI statistics structure under spi_device looks completely regular. All of the data for statistics (messages, transfers, timedout...) seems OK. As for SPI rx data, there could be zeros, but also some random byte overwrites very huge chunk of buffer. For example here how it looks like:

Valid SPI RX buffer:

80 00 01 01 02 00 00 00 00 78 00 FB 97 FF 17 FF FF FF FF 0A 00 07 00 00 00 00 00 78 0A 83 02 3E 03 46 03 06 00 0D 00 08 00 00 00 F8 05 00 00 00 00 0A 00 09 00 0A 00 F7 05 A9 02 0C 0C F8 0B B4 02 D5 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 64 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 01 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 58 52

We can see we have some header, some useful data and checksum bytes at the end.

Examples of Invalid RX buffers have following output:

Example a) - Random byte 7B overwrites correct data and data shift

7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 7B 84 00 01 07 02 00 00 00 00 7C 00 08 11 00 00 06 46 00 00 75 D6 D6 00 E4 02 5E E1 01 00 A4 00 15 DF 1A 00 D4 05 25 7E 02 00 34 00 16 C9 F5 03 3C 01 00 00 00 00 64 02 F5 02 1E 00 A4 00 BB 0B 07 00 5C 02 7E 12 00 00 EC 00 00 00 00 00 00 00 58 5A 1B 00 59 5A 1B 00 0C 8C 00 00 05 46 00 00 BF F7 AB 00 DE 1A 00 00 09 00 00 00 06 46 00 00 C3 40 36 00 06 46 00 00 C1 40 36 00 0B 3B 02 00 AE 1E 00 00 AA 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Example b) - Zeros overwrites correct data and data shift

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0 00 00 00 80 00 01 01 02 00 00 00 00 78 00 f3 f7 ff 1f ff ff ff ff 09 00 06 00 00 00 00 00 7b 0a 77 02 3b 03 4a 03 06 00 08 00 0c 00 00 00 f6 05 06 00 12 0b 03 00 06 00 0a 00 ec 05 b0 02 09 0c fb 0b 52 06 d8 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 64 00 00 00 01 00 64 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 01 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

So, we can see that data are shifted and sometimes we have random byte that overwrites our buffer (in this example it is 7B) but there is always different byte written.

Regards
//Tarik

Sanket_Parekh · ‎01-04-2023

Hi @tarik-dzananovic-sn

I hope you are doing well.

Please accept my apologies for delayed answer.

One can check the transfer type is 7 ["MCU domain CSPI"] in the device tree.

=>dmas = <&sdma x 7 x>;

This change may help you in case of data corruption.

Please refer for more information fsl-imx-sdma.txt.

We would also suggest you go through once from known limitations of the chip and given workarounds.

ERR009535, ERR009606 and ERR009165 from Errata IMX8MN_0N14Y.PDF

Thanks & Regards

Sanket Parekh

tarik-dzananovic-sn · ‎01-10-2023

Hi Sanket,

I have looked at Errata and device tree configuration (our DMA was set to 7 for ECSPI1 as expected). All of the values in registers seem to have expected values.

To mention again that this issues happens under heavy CPU load and it happens sporadically (So it is not something that puts system in bad state) but we are still wondering why these data get corrupt (and who corrupts them). Have also tried to check (and even change) ECSPI1 registers that Christian mentioned but still all of them look in order. Except that sometimes RX register in test read has value different then 0 in cases buffer gets corrupt. For example if last n bytes of our SPI message were this corrupt B7 byte, then test read would also read B7 from ECSPI1 Rx register.
SPI clock is set to 6.25 MHz and we are polling data every 5 ms. Have also tried to change this values as well, but it doesn't help.

So my assumption would be that under heavy load, our SPI driver can't catch up (can't acquire CPU time) and polling would be delayed, that would make sense to me. But what doesn't make sense is this occasional data corruption (specially since we are using DMA) Who could corrupt this data?

Just a note after looking at the data, it does seem that our receive buffer is always shifted (when data is corrupt of course). So there is always part of data we are expecting and part is not-useful (garbage) data.

Regards
//Tarik

ceggers · ‎01-13-2023

Hi Tarik,

when testing without (S)DMA, ECSPIx_CONREG::SMC should be zero, that means that the SPI controller is operated in "stop-and-go" mode. A (larger) transfer is divided into chunks of FIFO size. After filling the TX FIFO by the CPU, the transfer is started by setting ECSPIx_CONREG::XCH to 1. After the transfer is completed, the CPU reads the received data from the RX FIFO (the controller doesn't transfer any data at this time).

During my own investigations in the past, I checked ECSPIx_TESTREG::RXCNT after each read from ECSPIx_RXDATA. In case of the "inserted RX bytes error", I recognized that RXCNT didn't decrease for those reads. That means, a read from ECSPIx_RXDATA returned a wrong (zero in my case) value and RXCNT wasn't decremented by one. With this finding I was able to develop counteractive measures in the ecspi driver and in the SDMA firmware (checking RXCNT after each read of RXDATA and read RXDATA again if RXCNT didn't decrease).

Maybe you want to check whether the same happens on your setup.

regards,
Christian

tarik-dzananovic-sn · ‎01-16-2023

Hi Christian,

Complete TESTREG has value 0 both before and after calling spi_sync function form my SPI kernel driver.
Only thing worth nothing, is that if I add dev_err kernel prints inside my driver method (to check TESTREG value periodically) I can see increase in my SPI checksum errors (I don't even have to put CPU under load). Not sure is that expected because I assume print takes CPU time causing wrong data in my SPI driver?

Regards
//Tarik

Sanket_Parekh · ‎01-16-2023

Hi @tarik-dzananovic-sn

I hope you are doing well

Please find the answer below.

Yes, I agree with you. Also there are chances of data corruption due to the below reason.

As we know, using DMA, the data goes from peripheral to memory using bus and without using DMA the data goes from peripheral to CPU and from CPU to memory using bus. There are chances that when the device is sending data slowly or for any reason the bus becomes ideal, then the CPU might try to use that bus as the CPU is under heavy load. If it succeeds then the chances of data corruption increases.

Thanks & Regards

Sanket Parekh

tarik-dzananovic-sn · ‎01-16-2023

Hi Sanket,

Yes that explanation sounds reasonable enough. Are there any mitigation steps I can try to avoid this, ensure that checksum errors doesn't happen, meaning preventing any influence on SPI data, or is this something we have to live with?

Regards
//Tarik

tarik-dzananovic-sn · ‎01-13-2023

Thanks Christian,

as I remember all of my registers I was checking (Conreg, ConfigReg, DMAReg, StatReg, TestReg) had normal(expected) values when this error occurs, but will check once more and will let you know if I find anything suspicious.

//Tarik

tarik-dzananovic-sn · ‎01-09-2023

Hi thanks, sorry for my late reply, was out on Vacation.

I will try things you suggested, and will let you know as soon as I have some results.

Regards
//Tarik

ceggers · ‎12-19-2022

Hi Tarik,

thanks for providing examples of the corrupted data.

It looks like your problem is different from my one, because I had only single individual zero bytes inserted in the RX buffer (no groups of many bytes / non-zero values like in your examples). The "SPI Timeout" messages I got (in the kernel log) are only visible in kernels >= 5.10 if I remember correctly, as older kernels don't implement these error messages.

Anyhow, for further analysis I would recommend to check (e.g. debug messages) whether ECSPIx_CONFIG, ECSPIx_CTRL and ECSPIx_TESTREG contain the correct values before/after each transfer. The FIFO counts in TESTREG should be zero before/after each transfer.

regards,
Christian

ceggers · ‎12-19-2022

Hi Tarik,

maybe you hit the same problem I suffer from on the i.MX6ULL. I also have SPI problems but they manifest in SPI timeout messages (on newer kernels). After some investigation I came to the following findings:

SPI problems can arise when multiple peripherals are used simultaneously
- I didn't test whether this (also) depends on the CPU load
I only saw these problems on SPI, but never on other peripherals (e.g. UART)
Debugging into the "SPI timeout" messages showed that the ECSPI_CTRL and ECSPI_CONFIG registers often had an unexpected content.
I had also corrupted RX data:
- There were superfluous zeroes "inserted" into the RX buffer. At the end the transfer, some (correct) bytes were left in the RX FIFO
I tested with SDMA and also without SDMA, but the problem never really disappeared.

From my analyses I came to the conclusion that under "load conditions" single accesses the the SPI registers are not executed:

write accesses are sometimes ignored (SPI register stays at the previous value)
single read accesses return zero (although the register/FIFO contains a non-zero value)
- RX FIFO count does not decrement in this case

I am highly interested whether you have the same behavior on i.MX8. For my use case I was program "workarounds" by modifying the Linux driver and the SDMA firmware (but with some loss of performance).

regards,
Christian