i.MX7D SPI PIO errors in newest kernel

mischo5500 · ‎01-12-2022

Hi,

I'm using i.MX7D with newest kernel 5.10.72. On our board, we have MCP3004 ADC connected to ECSPI2 and MAX98090 audio card connected to I2C3 and SAI1 interface.

There is kernel module present, that reads from ADC using IIO interface each 1s. Everything is working fine, until I play some audio (using aplay) multiple times in a row an then stop.There is a big chance, that SPI starts to throw errors on all reads after last play:

mcp320x spi1.0: I/O Error in PIO
mcp320x spi1.0: SPI transfer failed: -110
spi_master spi1: failed to transfer one message from queue

I was not able to reproduce it other way, but it can related to CPU load caused by playing too. There is no way to get it out of this state, only by causing SPI to go to suspend mode by unloading my module and reloading module after 2s (MXC_RPM_TIMEOUT), which causes disable/enable of SPI clocks, which seems to get SPI working back.

If I decrease MXC_RPM_TIMEOUT in spi-imx.c under ADC read interval in my module, problem never happens, because driver is always going to sleep between reads, but that is not very good solution (same if I revert commit 525c9e5a - "spi: imx: enable runtime pm support" which enables pm instead of clock enable/disable between each transfer).

My DTS:

&ecspi2 {
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_ecspi2>;
	cs-gpios = <&gpio4 23 GPIO_ACTIVE_HIGH>;
	status = "okay";

	//ADC converter
	mcp3004: mcp3004@0 {
		compatible = "microchip,mcp3004";
		reg = <0>;
		spi-max-frequency = <22000>;
		spi-cpha;
		spi-cpol;
		#io-channel-cells = <1>;
	};
};

pinctrl_ecspi2: ecspi2grp {
	fsl,pins = <
		MX7D_PAD_ECSPI2_MISO__ECSPI2_MISO	0x2
		MX7D_PAD_ECSPI2_MOSI__ECSPI2_MOSI	0x2
		MX7D_PAD_ECSPI2_SCLK__ECSPI2_SCLK	0x2
		MX7D_PAD_ECSPI2_SS0__GPIO4_IO23		0x59
	>;
};

sound {
	...		
	assigned-clocks = <&clks IMX7D_AUDIO_MCLK_ROOT_SRC>,
			<&clks IMX7D_AUDIO_MCLK_ROOT_CLK>;
	assigned-clock-parents = <&clks IMX7D_PLL_AUDIO_POST_DIV>;
	assigned-clock-rates = <0>, <12288000>;
		
	simple-audio-card,cpu {
		sound-dai = <&sai1>;
	};

	dailink_master: simple-audio-card,codec {
		sound-dai = <&max98090>;
		clocks = <&clks IMX7D_AUDIO_MCLK_ROOT_CLK>;
	};
};

&sai1 {
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_sai1>;
	assigned-clocks = <&clks IMX7D_SAI1_ROOT_SRC>,
			  <&clks IMX7D_SAI1_ROOT_CLK>;
	assigned-clock-parents = <&clks IMX7D_PLL_AUDIO_POST_DIV>;
	assigned-clock-rates = <0>, <36864000>;
	status = "okay";
};

pinctrl_sai1: sai1grp {
	fsl,pins = <
		MX7D_PAD_SAI1_MCLK__SAI1_MCLK			0x1f
		MX7D_PAD_SAI1_TX_BCLK__SAI1_TX_BCLK		0x1f
		MX7D_PAD_SAI1_TX_SYNC__SAI1_TX_SYNC		0x1f
		MX7D_PAD_SAI1_TX_DATA__SAI1_TX_DATA0	0x30
		MX7D_PAD_SAI1_RX_DATA__SAI1_RX_DATA0	0x1f
	>;
};

&i2c3 {
	//Audio codec
	max98090: max98090@10 {
		compatible = "maxim,max98090";
		reg = <0x10>;
		clocks = <&clks IMX7D_AUDIO_MCLK_ROOT_CLK>;
		clock-names = "mclk";
		
		pinctrl-names ="default";
		pinctrl-0 = <&pinctrl_max98090>;
		
		interrupt-parent = <&gpio1>;
		interrupts = <14 GPIO_ACTIVE_LOW>;
		#sound-dai-cells = <0>;
	};
};

pinctrl_max98090: max98090grp {
	fsl,pins = <
		MX7D_PAD_GPIO1_IO14__GPIO1_IO14		0x59 //Interrupt
	>;
};

pinctrl_i2c3: i2c3grp {
	fsl,pins = <
		MX7D_PAD_I2C3_SDA__I2C3_SDA			0x4000007f
		MX7D_PAD_I2C3_SCL__I2C3_SCL			0x4000007f
	>;
};

When SPI is stuck, I can see on osciloscope, that there are only chip selects happening (no data on CLK, MISO, MOSI), but it has strange shape. It is either inverted, or really long (can't determine without other signals), because active duration of chip select is longer, than deactivated duration and interval between reads should be 1s.

Have anybody encountered similar problem? It is really strange and I can't find route cause of it.

Thanks in advance,

Michal Špánik

mischo5500 · ‎01-20-2022

I think, that I found source of problem.

ECSPI2 is using clock IMX7D_SAI1_ROOT_CLK and audio card is using clock IMX7D_AUDIO_MCLK_ROOT_CLK for MCLK signal.

When I look into clk-imx7d.c there is

hws[IMX7D_ECSPI2_ROOT_CLK] = imx_clk_hw_gate4("ecspi2_root_clk", "ecspi2_post_div", base + 0x4790, 0);

hws[IMX7D_AUDIO_MCLK_ROOT_CLK] = imx_clk_hw_gate4("audio_mclk_root_clk", "audio_mclk_post_div", base + 0x4790, 0);

As you can see, both are using offset 0x4790 without setting common count, as it is e.g. for

hws[IMX7D_SAI1_ROOT_CLK] = imx_clk_hw_gate2_shared2("sai1_root_clk", "sai1_post_div", base + 0x48c0, 0, &share_count_sai1);
hws[IMX7D_SAI1_IPG_CLK] = imx_clk_hw_gate2_shared2("sai1_ipg_clk", "ipg_root_clk", base + 0x48c0, 0, &share_count_sai1);

I don't know, if IMX7D_AUDIO_MCLK_ROOT_CLK even should be at that address, there is no CCGR register defined for it in Clock tree in Refrence manual (see attached picture). CCGR121 has offset 0x4790, which is defined in clock file.

Problem is, that even from quick look, I can see same unhandled conflict for

hws[IMX7D_ECSPI3_ROOT_CLK] = imx_clk_hw_gate4("ecspi3_root_clk", "ecspi3_post_div", base + 0x47a0, 0);

hws[IMX7D_WRCLK_ROOT_CLK] = imx_clk_hw_gate4("wrclk_root_clk", "wrclk_post_div", base + 0x47a0, 0);

I can't find any usage of IMX7D_WRCLK_ROOT_CLK, but whole file needs to be definitely checked by somebody competent !!!

Is there any place to report bugs like this?

Thanks in advance,

Michal

AldoG · ‎01-20-2022

Hello,

Thank you for sharing your results, I’ll check with internal team and share your findings.

Best regards,
Aldo.

mischo5500 · ‎10-04-2022

Hi @AldoG ,

Thanks for passing data to the team.

I have just updated kernel to the latest Kirkstone version 5.15.52 and see, that dev team tried to fix the problem in https://github.com/nxp-imx/linux-imx/commit/4bd979ee9927ba06f9156eea57c5b9d5e4b748ff.

Problem is, that after update, audio stops to work completely. When I run "aplay audioFile.waw" command, it just hangs forever and does not play anything.

When I return the

hws[IMX7D_AUDIO_MCLK_ROOT_CLK]

gate, it works (with same problem, as before).

Regards,

Michal Špánik

ceggers · ‎01-19-2022

Hi Michal,

I also have "SPI I/O Errors" on my i.MX6ULL system.

What is different:

I don't use runtime power management
I don't need to unload the driver in order to get it working again.

What is similar:

I only get these errors when I/O on other peripherals is performed simultaneously.
I more often see these errors exactly when stopping other I/O.

I would be interested in the contents of the SPI registers at the time when the error occurs. Can you set a breakpoint on the error message and dump the registers? If you have no kernel debugger, printing the registers values directly after the error message using printk() should also work. Please try this several times (I would like to see whether there are different types of errors).

When the error happens on my system, at least one of the SPI control/config registers contains a wrong value. I had also cases where one byte less was written into the TX FIFO.

regards,
Christian

mischo5500 · ‎01-20-2022

Hi Christian,

Thanks for response

I'm attaching reg dump. I have tried it multiple times and regs were always same, only RXDATA is changing between attempts. Transfer length is 1 byte.

So it happens to you e.g. when you read file, or what kind of I/O?

I have figured, that I don't need to unload driver too, disabling/enabling of clk_per and clk_ipg between transfers is enough and PIO error never happens, but I'm afraid, that it can be working correctly only because of small 1 byte - 2 bytes transfers, so there is only small chance of hitting it in the middle of transfer.

For you, next transfer after I/O error is OK?

EDIT: I have tried to dump regs, when transmission is OK and here is result. It makes one transmission of length 1 and one of length 2 immediately. All regs are same as when I/O error is present, except CONREG. When there is no error CONREG = 7383281 for trasmission of 1 and CONREG = 15771889 for trasnsmission of 2. When there is error, CONREG = 15771889 for transmission of 1, But burst length is set correctly at the beginning of transmission!

[ 35.302341][ T157] RXDATA 254
[ 35.306227][ T157] TXDATA 0
[ 35.309126][ T157] CONREG 7383281
[ 35.312535][ T157] CONFIGREG 1048849
[ 35.316745][ T157] INTREG 0
[ 35.319642][ T157] DMAREG 0
[ 35.322529][ T157] STATREG 131
[ 35.326194][ T157] PERIODREG 0
[ 35.329349][ T157] TESTREG 0
[ 35.332324][ T157] MSGDATA 0
[ 35.335888][ T157] Transfer length 1

[ 35.347973][ T157] RXDATA 0
[ 35.350868][ T157] TXDATA 0
[ 35.353866][ T157] CONREG 15771889
[ 35.358167][ T157] CONFIGREG 1048849
[ 35.361844][ T157] INTREG 0
[ 35.364838][ T157] DMAREG 0
[ 35.368669][ T157] STATREG 131
[ 35.371824][ T157] PERIODREG 0
[ 35.375083][ T157] TESTREG 0
[ 35.378834][ T157] MSGDATA 0
[ 35.381814][ T157] Transfer length 2

EDIT2:

From my observations, when SPI because stuck, driver is unable to update CONREG register. That is the reason for wrong CONREG value, it remains set for 2 byte transfer from last working transfer. I can read registers correctly. It points to somehow disabled peripheral clocks maybe (that would explain, why enable/disable gets SPI to work correctly)? I don't know, if I would be able to read registers without clock though. Would print SPI clock registers, but don't know which ones, it is quite hard to find correct clock reg address without connected debugger

Best regards,

Michal

ceggers · ‎01-20-2022

Hi Michal,

thanks for taking the logs. Maybe you would like to use hexadecimal numbers next time...

> For you, next transfer after I/O error is OK?
yes

In spiLog.txt, "Transfer length" is always one, whilst CONREG is always 15771889 (0xF0A8F1), which means "16 bit transfer".

IMHO the "I/O Error" messages are always caused by timeouts waiting for an interrupt. So it is suspicious that INTREG is always 0, which means that interrupts are disabled (so no IRQs will ever happen).

> when SPI because stuck, driver is unable to update CONREG register.
On my system, I saw "unable to update" problems on different registers. Additionally, sometimes registers are read as 0x0, although the actual value is different (e.g. when reading a second time). The lead to wrong "read-modify-write" in my case.

Of course, this "register access problems" could be caused by a (temporarily) disabled clock. On i.MX6, all registers on the SPBA bus are clocked by the common SPBA clock (and not by the individual peripherals' clocks). I think that I already tried to capture possible interruptions of the SPBA clock via a write breakpoint (watchpoint) of the debugger, but without any success.

regards,
Christian

mischo5500 · ‎01-20-2022

Hi Christian,

Thanks for help, printing of registers directed me to the real problem, see my other response. It is caused by bug in clock driver for i.MX7 (audio mclk and ecspi2 shares clock gate without any protection).

You can check, if there is not similar problem with i.MX6 for peripherals, that causes problem to you (file is in drivers/clk/imx/clk-imx6??.c). Easiest way is probably to check CCGR registers for clocks required by peripheral, when problem occurs.

ceggers · ‎01-21-2022

Hi Michal,

You can check, if there is not similar problem with i.MX6 for peripherals, that causes problem to you (file is in drivers/clk/imx/clk-imx6??.c). Easiest way is probably to check CCGR registers for clocks required by peripheral, when problem occurs.

I assume that my problem has a different cause.

My register read/write problems are not permanent, they occur only on single, individual r/w accesses. The next access is successful then. For me it is very unlikely, that a (shared) clock is disabled only for such a short time period.
I didn't find shared clock gates in clk-imx6ul.c. I assume that for register access the SPBA bus clock is required (and not the ECSPI clock). But for both cases, I couldn't find reuse for the individual clock gate bits:

hws[IMX6UL_CLK_SPBA]		= imx_clk_hw_gate2("spba",		"ipg",		base + 0x7c,	12);
...
	hws[IMX6UL_CLK_ECSPI1]		= imx_clk_hw_gate2("ecspi1",	"ecspi_podf",	base + 0x6c,	0);
	hws[IMX6UL_CLK_ECSPI2]		= imx_clk_hw_gate2("ecspi2",	"ecspi_podf",	base + 0x6c,	2);

My assumption is, that one of the several bus bridges on the SOC is not able to handle the amount of traffic which is generated by simultaneous I/O.

regards,
Christian

AldoG · ‎01-18-2022

Hello,

I would like to know if in previous kernel version did this work fine?

Could you share the CPU load when the issue occurs?

Also, did you remove DMA from dts, since you're using PIO mode, but DMA is also enabled in dts, for our board this is enabled at imx7s.dtsi

https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm/boot/dts/imx7s.dtsi?h=lf-5.10.y#n...

Please try by remove/comment the following lines:

dmas = <&sdma 0 7 1>, <&sdma 1 7 2>;
dma-names = "rx", "tx";

Best regards,
Aldo.

mischo5500 · ‎01-19-2022

Hi Aldo,

thanks for response.

It works correctly in older kernels (probably before 5.10.9) before commit

commit 525c9e5a32bd7951eae3f06d9d077fea51718a6c
Author: Clark Wang <xiaoning.wang@nxp.com>
Date:   Mon Jul 27 14:33:54 2020 +0800
    spi: imx: enable runtime pm support
    Enable runtime pm support for spi-imx driver.
    Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
    Link: https://lore.kernel.org/r/20200727063354.17031-1-xiaoning.wang@nxp.com
    Signed-off-by: Mark Brown <broonie@kernel.org>

because it implements PM management, which removes enabling/disabling of clocks between each SPI transfer (clock reset alone unstucks spi from bad state), when time between transfers is lower than MXC_RPM_TIMEOUT. It handles it suspend/resume functions, when no transfer is present for 2s. Setting MXC_RPM_TIMEOUT to 0 puts driver to previous functionality and works for me, but I'm afraid, that it only masks real problem.

It happens me now commonly even after only one audio play, and CPU load is quite low, when it happens (image attached).

Driver works in PIO mode for me, because of transfer length check in spi_imx_can_dma function:

if (transfer->len < spi_imx->devtype_data->fifo_size)
    return false;

My transfers are only 1 and 2 bytes long, so it uses PIO (fifo_size is 64).

I have tried to completely remove DMA channels from dts, but it did not help.

I can try to capture SPI signals to logical analyzer, when problem occurs, if it would help.

Thanks in advance,

Michal Špánik

i.MX7D SPI PIO errors in newest kernel

i.MX7D SPI PIO errors in newest kernel

i.MX7Dual

Linux