i.mx6sx linux-fslc sema4 bug

kubiznak_petr · ‎10-04-2017

How to reproduce

I hit on a problem in linux-fslc 4.1.38, available at GitHub - Freescale/linux-fslc: Linux kernel source tree . It is related to the sema4 mutex locking mechanism (drivers/char/imx_amp/imx_sema4.c). The problem only arises when the M4 core is running and its RDC domain is set. I'm running a trivial FreeRTOS program based on the hello_world demo from the NXP's official distribution (https://www.nxp.com/webapp/Download?colCode=FreeRTOS_MX6SX_1.0.1_WIN&appType=license&Parent_nodeId=1... ). The program simply sets the RDC domain and loops forever:

RDC_SetDomainID(RDC, rdcMdaM4, BOARD_DOMAIN_ID, false);
while (1) ;

With Yocto (Morty), I built the qt4e-demo-image (fslc-framebuffer distro). When I boot the image, qtdemoE demo starts. The problem can be reproduced with an intensive I2C communication. In my case, I touch on the screen wildly, generating the touch events. My interrupt-driven touchscreen driver calls i2c_smbus_read_byte_data() to read the data.

What it does

After a while, the following message starts printing in the terminal infinitely:

drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 137 already locked, wait! num 6 val 1

The FreeRTOS on the M4 core continues without problem. If I stop the debugger and check the SEMA4_Gate06 register, its value is set to 1, indicating it is locked by the A9 core.

Workaround

After spending hours on this problem, I formed a testing patch (see attached). In the _imx_sema4_mutex_lock() function, it disables the error log for the first 5 occurences. As a result, instead of being imprisoned in the infinite error message loop, I just get a single message and everything seems to work fine:

286 Trying to unlock an unlock mutex.

When this repeats 5-times (which might take a while), the system enters the infinite message loop. Including my added prints, the output looks as follows:

286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 6
CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G           O    4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12
Hardware name: Freescale i.MX6 SoloX (Device Tree)
[<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24)
[<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94)
[<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4)
[<80406be8>] (_imx_sema4_mutex_lock) from [<80406d58>] (imx_sema4_mutex_lock+0xa4/0x128)
[<80406d58>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4)
[<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c)
[<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4)
[<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40)
[<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28)
[<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678)
[<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0)
[<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8)
[<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c)
[<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c)
[<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50)
[<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188)
[<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104)
[<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c)
l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4
uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 7
CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G           O    4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12
Hardware name: Freescale i.MX6 SoloX (Device Tree)
[<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24)
[<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94)
[<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4)
[<80406be8>] (_imx_sema4_mutex_lock) from [<80406d88>] (imx_sema4_mutex_lock+0xd4/0x128)
[<80406d88>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4)
[<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c)
[<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4)
[<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40)
[<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28)
[<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678)
[<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0)
[<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8)
[<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c)
[<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c)
[<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50)
[<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188)
[<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104)
[<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c)
l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4
uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4
... (repeats infinitely) ...

The problem

The output shows the problem is in calling the imx_sema4_mutex_lock() function from two places at once (clk_gate2_do_shared_clks() and clk_pllv3_do_shared_clks()). One comes from the touchscreen interrupt handler, second comes from some worker thread, but I don't know more details about it.

It is obvious that printing anything to the console from an interrupt handler is not okay. So I suggest to disable all debug/error prints in the imx_sema4.c file. Then the infinite message loop would be avoided.

The question is, what is the root cause. Is the debug print really the problem (i.e. it is a valid situation that mutex_ptr->gate_val == SEMA4_A9_LOCK), or does it just expose a problem hidden somewhere else (i.e. the debug print should actually never appear).

As the aforementioned output shows, the condition (mutex_ptr->gate_val != SEMA4_A9_LOCK) in imx_sema4_mutex_unlock() holds true when the problematic situation arises. I guess it is because the gate_val is already 0, as the unlock function has been called twice, which corresponds to the two locking calls. That suggests the locking mechanism does not work correctly.

Is my reasoning correct? How to properly fix the problem? I guess this question might be interesting for OtavioSalvador or Daiane Angolini .

Original Attachment has been moved to: imx_sema4.patch.zip

davidpatton · ‎07-09-2018

Attached is the patch; it allows more than 1 A9 thread to hold a SEMA4...I can only test on the kernel I have. I put a comment in imx_sema4_lock() where it needs to go to sleep, but since I am not currently having contention issues between the A9/M4, I have not investigated how to put a thread on a wait_q.

View solution in original post

daiane_angolini · ‎10-11-2017

For the kernel RT, I would take the one from kernel mainline. Another option is to follow the recipe in meta-freescale-3rd to take a branch with imx6 support.

ultimately, please send the bug to the meta-freescale mailing list so we can enter a bug on bugzila.

kubiznak_petr · ‎07-04-2018

I finally had some time to at least do this blind trying. I used the linux-fslc-imx-rt_4.1-2.0 from the meta-freescale layer of the community rocko branch. I wasn't able to reproduce the bug with this realtime kernel, which is promising. It would be great to get this confirmed from someone else experiencing the same issue, e.g. davidpatton‌.

---

Just a note regarding the kernel version. linux-fslc-imx-rt_4.1-2.0 is the name of the recipe, anyway it corresponds to the kernel version 4.1.38 (4.1.38-rt45-fslc+gee67fc7e072d in particular) at both morty and rocko branches. That is the same kernel version I originally reported, as it was the version corresponding to the linux-fslc-imx_4.1-2.0 recipe at morty. And according to the FSL Community BSP Release Notes 2.4 (Draft document) documentation , the 4.1-2.0.x+git kernel "version" is still the default kernel for imx6sxsabresd. That's why I believe this bug report should be interesting for the community.

kubiznak_petr · ‎01-09-2018

Hi Daiane, I've reported the bug 3 months ago, including the mailing list post ([meta-freescale] i.mx6sx linux-fslc sema4 bug ). There was absolutely no reply in the mailing list and I didn't find any recent update in the imx_sema4.c file in the linux-fslc kernel.

I don't consider blind trying of more and more kernel versions to be the right strategy. I would expect someone competent to look at the issue, try to reproduce it and give some solution. Or to say that it was already fixed by those and those patches, so I am able to either switch to the proven kernel version, or backport the fix to my version. Some analytical approach would be appreciated.

daiane_angolini · ‎01-10-2018

We recommended you to test in different kernel versions and providers only because you are reporting the bug in a kernel that is out of the support boundaries for both nxp and the community.

I don't know how to fix the bug, and I can only escalate it internally if it's reproducible on the reference board using the latest NXP IMX kernel for that soc.

i.mx6sx linux-fslc sema4 bug