SDMA channel 0 timeout causes and measures

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

SDMA channel 0 timeout causes and measures

1,391 Views
Tomohiro
Contributor II

Dear NXP team,

I am using i.MX8M mini and Linux kernel is version 5.4.47.

I am suspend and resume in my Linux system.
Sometimes, when Linux resumes, hits the following message and stops the systems.

imx-sdma 30bd0000.dma-controller: Timeout waiting for CH0 ready

My operations are as follows.
(1)Resume setting with "rtcwake" command.
(2)Suspend with "systemctl suspend" command.
(3)When resuming, the following message may be displayed(TIMEOUT of SDMA)

   imx-sdma 30bd0000.dma-controller: Timeout waiting for CH0 ready

I checked the SDMA registers when TIMEOUT occurred, they were as follows.
(Difference of registers when TIMEOUT occured and NORMAL(not occurred).)

Channel Stop/Channel Status (SDMAARM1_STOP_STAT), bit0(HE[0]):TIMEOUT=b1 / NORMAL=b0
Schedule Status (SDMAARM1_PSW), bit[15:13](NCP[2:0]):TIMEOUT=b111 / NORMAL=b000
OnCE Status Register (SDMAARM1_ONCE_STAT), bit[15:12](PST[3:0]):TIMEOUT=b0101 / NORMAL=b0110

I tried the following, but TIMEOUT is not resolved.
I extended timeout setting(500us->10s) in the sdma_run_channel0() of the Linux kernel source (/driver/dma/imx-sdma.c).

   ret = readl_relaxed_poll_timeout_atomic(sdma->regs + SDMA_H_STATSTOP,reg, !(reg & 1), 1, 10000000); // default 500

Any causes what could be causing this issue?
Also, is there a way to work around this issue, or a way to recover if it occurs?

Labels (1)
Tags (1)
0 Kudos
5 Replies

1,318 Views
Tomohiro
Contributor II

Dear Christian

Thank you for your answer.
This problem occurs even without using the JTAG debugger.

I have a JTAG debugger, so I checked the register values in the debugger to research this problem.

I checked the SDMAARM_PSW::CCP/CCR and NCR when TIMEOUT occurred, they were as follows.
(Difference of registers when TIMEOUT occured and NORMAL(not occurred).)

CCP,TIMEOUT=b1110 / NORMAL=b1110
CCR,TIMEOUT=b0000 / NORMAL=b0000
NCR,TIMEOUT=b0000 / NORMAL=b0000

CCP, CCR, and NCR are in the same state.
SDMA channel 0 appears unused before and after suspend.

I checked the register value by setting a break in the sdma_run_channel0 function at the following location.
(1->:TIMEOUT、2->:NORMAL)

>static int sdma_run_channel0(struct sdma_engine *sdma)
>{
> int ret;
> u32 reg;
>
> sdma_enable_channel(sdma, 0);
>
> ret = readl_relaxed_poll_timeout_atomic(sdma->regs + SDMA_H_STATSTOP,reg, !(reg & 1), 1, 500);
> if (ret)
>1-> dev_err(sdma->dev, "Timeout waiting for CH0 ready\n");
>
> /* Set bits of CONFIG register with dynamic context switching */
>2-> reg = readl(sdma->regs + SDMA_H_CONFIG);
> if ((reg & SDMA_H_CONFIG_CSM) == 0) {
> reg |= SDMA_H_CONFIG_CSM;
> writel_relaxed(reg, sdma->regs + SDMA_H_CONFIG);
> }
>
> return ret;
>}

0 Kudos

1,310 Views
ceggers
Contributor V

Hi Tomohiro,

This problem occurs even without using the JTAG debugger.

a normal JTAG debugger wouldn't halt the SDMA CPU. This can only happen, if you use a JTAG debugger with special support for the SDMA (like my one).

CCP,TIMEOUT=b1110 / NORMAL=b1110
CCR,TIMEOUT=b0000 / NORMAL=b0000

That means that the bootloader channel (0) is currently active. This is not surprising in the NORMAL case, because you had just started this channel. But it is interesting, that also in the TIMEOUT case, the bootloader is the currently active SDMA channel. Note: CCP has only 3 bits.

SDMA channel 0 appears unused before and after suspend.

CCP != 0 && CCR==0 means that the bootloader channel is active.

You may want to check whether the bootloader channel is already active before you try to run it. I think this cannot happen under normal circumstances.

regards,
Christian

1,207 Views
Tomohiro
Contributor II

Dear Christian

CCP != 0 && CCR==0 means that the bootloader channel is active.

You may want to check whether the bootloader channel is already active before you try to run it. I think this cannot happen under normal circumstances.


I checked SDMAARM_PSW::CCP/CCR before SDMA initialization (during suspend).
SDMA channel 0 is active both TIMEOUT and NORMAL following states:
CCP,TIMEOUT=b111 / NORMAL=b111
CCR,TIMEOUT=b0000 / NORMAL=b0000

A possible reason for SDMA being active is that I am making changes to the ATF and Linux kernel according to the following page to keep the M4 core running during suspend.
https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/M4-Low-Power-Demo-on-i-MX8MM/ta-p/110110...
(Currently applying 0001-atf-m4-run-for-5.4.24-kernel.patch and 0001-iMX8MM-GIR-wakeup-for-5.4.24-kernel.patch)

In particular, changing the PLL settings in ATF may have an effect.

0 Kudos

1,194 Views
ceggers
Contributor V

Hi Tomohiro,

A possible reason for SDMA being active is that I am making changes to the ATF and Linux kernel according to the following page to keep the M4 core running during suspend.

sorry, but I don't know what ATF is.

My main assumption is, that all SDMA channels must be successfully stopped before the system goes to suspend mode. As the SDMA scheduling is cooperative, channels cannot really be stopped from the ARM side. The must give up their processing slice voluntarily.

When leaving suspend mode, it must be ensured, that all bus/peripheral clocks are active again before any new DMA events arrive at the SDMA.

regards,
Christian

 

0 Kudos

1,368 Views
ceggers
Contributor V

Hi Tomohiro,

I have no solution for your problem, but maybe some hints:

I personally get those timeout errors when I stopped the SDMA (with the JTAG debugger) without stopping the ARM CPU in advance. That means, the ARM tries to load firmware/channel context into the SDMA, but the SDMA doesn't respond. In your case (I assume that you didn't stop the SDMA with a debugger) that could mean that the SDMA is stuck for some reason (e.g. bus fault or clocking problem).

Another possible reason for the timeout errors is, that a previous transfer doesn't finish (within the timeout). This is also an indication, that a SDMA transfer got stuck due to bus problems.

A (permanent) PST value of 0b0101 (Functional Unit) is also a strong indication, for a bus problem. Unfortunately such errors are difficult to debug, as the SDMA cannot be halted (via JTAG) and investigated in that state.

I guess that such problems appear if a SDMA transfer in active when entering suspend mode. Stopping the internal busses may stuck ongoing memory/peripheral accesses by a functional unit (probably forever). Postmortem reading of SDMAARM_PSW::CCP/CCR could give you an indication whether a SDMA transfer was running and which channel was in use. The current assignment of SDMA channels can be determined via sysfs.

regards,
Christian