i.mx6sx linux-fslc sema4 bug

kubiznak_petr · ‎10-04-2017

How to reproduce

I hit on a problem in linux-fslc 4.1.38, available at GitHub - Freescale/linux-fslc: Linux kernel source tree . It is related to the sema4 mutex locking mechanism (drivers/char/imx_amp/imx_sema4.c). The problem only arises when the M4 core is running and its RDC domain is set. I'm running a trivial FreeRTOS program based on the hello_world demo from the NXP's official distribution (https://www.nxp.com/webapp/Download?colCode=FreeRTOS_MX6SX_1.0.1_WIN&appType=license&Parent_nodeId=1... ). The program simply sets the RDC domain and loops forever:

RDC_SetDomainID(RDC, rdcMdaM4, BOARD_DOMAIN_ID, false);
while (1) ;

With Yocto (Morty), I built the qt4e-demo-image (fslc-framebuffer distro). When I boot the image, qtdemoE demo starts. The problem can be reproduced with an intensive I2C communication. In my case, I touch on the screen wildly, generating the touch events. My interrupt-driven touchscreen driver calls i2c_smbus_read_byte_data() to read the data.

What it does

After a while, the following message starts printing in the terminal infinitely:

drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 137 already locked, wait! num 6 val 1

The FreeRTOS on the M4 core continues without problem. If I stop the debugger and check the SEMA4_Gate06 register, its value is set to 1, indicating it is locked by the A9 core.

Workaround

After spending hours on this problem, I formed a testing patch (see attached). In the _imx_sema4_mutex_lock() function, it disables the error log for the first 5 occurences. As a result, instead of being imprisoned in the infinite error message loop, I just get a single message and everything seems to work fine:

286 Trying to unlock an unlock mutex.

When this repeats 5-times (which might take a while), the system enters the infinite message loop. Including my added prints, the output looks as follows:

286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
286 Trying to unlock an unlock mutex.
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 6
CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G           O    4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12
Hardware name: Freescale i.MX6 SoloX (Device Tree)
[<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24)
[<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94)
[<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4)
[<80406be8>] (_imx_sema4_mutex_lock) from [<80406d58>] (imx_sema4_mutex_lock+0xa4/0x128)
[<80406d58>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4)
[<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c)
[<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4)
[<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40)
[<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28)
[<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678)
[<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0)
[<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8)
[<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c)
[<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c)
[<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50)
[<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188)
[<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104)
[<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c)
l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4
uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 7
CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G           O    4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12
Hardware name: Freescale i.MX6 SoloX (Device Tree)
[<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24)
[<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94)
[<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4)
[<80406be8>] (_imx_sema4_mutex_lock) from [<80406d88>] (imx_sema4_mutex_lock+0xd4/0x128)
[<80406d88>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4)
[<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c)
[<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4)
[<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40)
[<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28)
[<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678)
[<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0)
[<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8)
[<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c)
[<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c)
[<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50)
[<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188)
[<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104)
[<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c)
l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4
ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4
u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4
uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4
... (repeats infinitely) ...

The problem

The output shows the problem is in calling the imx_sema4_mutex_lock() function from two places at once (clk_gate2_do_shared_clks() and clk_pllv3_do_shared_clks()). One comes from the touchscreen interrupt handler, second comes from some worker thread, but I don't know more details about it.

It is obvious that printing anything to the console from an interrupt handler is not okay. So I suggest to disable all debug/error prints in the imx_sema4.c file. Then the infinite message loop would be avoided.

The question is, what is the root cause. Is the debug print really the problem (i.e. it is a valid situation that mutex_ptr->gate_val == SEMA4_A9_LOCK), or does it just expose a problem hidden somewhere else (i.e. the debug print should actually never appear).

As the aforementioned output shows, the condition (mutex_ptr->gate_val != SEMA4_A9_LOCK) in imx_sema4_mutex_unlock() holds true when the problematic situation arises. I guess it is because the gate_val is already 0, as the unlock function has been called twice, which corresponds to the two locking calls. That suggests the locking mechanism does not work correctly.

Is my reasoning correct? How to properly fix the problem? I guess this question might be interesting for OtavioSalvador or Daiane Angolini .

Original Attachment has been moved to: imx_sema4.patch.zip

davidpatton · ‎07-09-2018

Attached is the patch; it allows more than 1 A9 thread to hold a SEMA4...I can only test on the kernel I have. I put a comment in imx_sema4_lock() where it needs to go to sleep, but since I am not currently having contention issues between the A9/M4, I have not investigated how to put a thread on a wait_q.

在原帖中查看解决方案

davidpatton · ‎06-19-2018

I see a problem in the sema4 driver. An A9 thread is never put to sleep if it cannot obtain the sema4. A wait_q is created, and the ISR will wake up any thread on the wait_q, but the driver never puts a thread on the wait_q. In imx_sema4_mutex_lock() it just continuously tries to get the sema4, thus locking out the thread that holds the sema4 and we have a deadlock, in my case it is constantly spitting out the trace:

drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 137 already locked, wait! num 6 val 1.

If the M4 actually has the sema4, the A9 will be stuck with 1 thread in a tight loop, waiting for it. I'm fairly certain this is not correct behavior. If the A9 cannot obtain the semaphore, it should be put to sleep.

kubiznak_petr · ‎06-24-2018

Hi David, thank you for your investigation! I really appreciate it. It doesn't seem that anyone from NXP is ever going to dig into the issue. Do you think you would be able to patch the driver? I would be willing to do it, regrettably I don't know kernel programming that well. I could at least test the patch on version 4.1.38.

BiyongSUN · ‎06-26-2018

Have you ever tried NXP official BSP release?

NXP didn't do any release on 4.1.38

Please git the source code from the NXP git site.

kubiznak_petr · ‎06-29-2018

Hi Biyong, thanks for your reply. 4.1.15 was an official release, yet it contains exactly the same driver: linux-imx.git - i.MX Linux Kernel (yes, I did compare the files).

And as David recently posted, he's experiencing the same issue.

BiyongSUN · ‎07-01-2018

If you want to use the i2c with the "protocol", you need virtualization.

The hardware semaphore sema4 is just "hardware" mutex to let the hardware ready for use.

Here should be the i2c.

kubiznak_petr · ‎07-01-2018

I'm sorry, I don't really understand what you are suggesting. I have checked the drivers/i2c/i2c-core.c and drivers/i2c/busses/i2c-imx.c files for changes, and they are the same in 4.1.15 and 4.1.38.

Anyway, I am not sure that the bug relates to the I2C operation only. davidpatton‌, may you please describe your use case in more details?

davidpatton · ‎07-05-2018

My use case is an imx6sx with the M4 running "hello world" loaded in u-boot.This M4 binary does nothing with the SEMA4's, just spits out "Hello World" and a count. During linux kernel boot, 2 different threads will call shared clock init's, they check to see if the M4 is out of reset, if it is, they try to acquire a SEMA4, the first thread will get it, when the other thread enters the picture, it will discover that the M4 is running, so it will check the same SEMA4, since the other thread in the A9 already has it, _imx_sema4_mutex_lock() will return -EBUSY, this is bug #1; imx_sema4_mutex_lock() will then spin forever in a tight loop calling _imx_sema4_mutex_lock(), this is bug #2.

We'll start with bug #2, it's fairly simple; in imx_sema4_mutex_lock(), if _imx_sema4_mutex_lock() returns -EBUSY, the thread should be placed in the wait_q to be awakened by the interrupt routine. When it wakes up, it can again attempt to acquire the SEMA4. It cannot simply spin forever; although this did help point out bug #1 probably easier than it would have been otherwise.

For bug #1, _imx_sema4_mutex_lock() should only return -EBUSY when the other core, i.e. the M4, has the SEMA4. It is not a BUSY state for another thread, in the same core, to hold the SEMA4. I resolved this by adding a counting mechanism, first thread actually acquires the SEMA4, other threads attempting to acquire it, just increment the count; when the SEMA4 is being released, only when the count hits 0 is it actually release. I added a print to _imx_sema4_mutex_lock(), the count always goes to 2 during boot, so I believe this is the appropriate action for this deadlock.

kubiznak_petr · ‎07-08-2018

Thanks a lot for your reply, David. Could you please share your patch? I'd like to test it with my case and compare with the RT kernel. I would also appreciate if you could test with RT kernel as well, but I understand if you're too busy, of course.

davidpatton · ‎07-09-2018

Attached is the patch; it allows more than 1 A9 thread to hold a SEMA4...I can only test on the kernel I have. I put a comment in imx_sema4_lock() where it needs to go to sleep, but since I am not currently having contention issues between the A9/M4, I have not investigated how to put a thread on a wait_q.

kubiznak_petr · ‎07-12-2018

Thank you David, perfect. I've tested with both RT and non-RT kernel. In non-RT, the message "imx_sema4_mutex_unlock() - count = 1." sporadically appears, but no deadlock happens. In the RT kernel, I didn't see the message nor deadlock during my test.

I'm having no A9/M4 issues as well, so for my case this is sufficient. Thank you again for helping to solve this issue.

BiyongSUN · ‎07-02-2018

The hardware sema4 is for set flag for like writing the register here is the i2c register.

But it could not handle the i2c protocol sequence.

soyo_lin · ‎02-16-2020

About this bug, Is there any patch from NXP?

davidpatton · ‎06-13-2018

Was this ever resolved? I'm having the same issue; my kernel is older than the original posters (4.1.15) so it makes sense that I'd have it, but I need a fix. Thanks...

daiane_angolini · ‎10-09-2017

Hi Petr, sorry for my late reply, it was a very busy week.

Have you tried to follow the mutex lifetime? I mean, which other components deal with the same?

Another point, have you tried to reproduce the same using the 4.1.5_2.0.0_GA kernel?

daiane_angolini · ‎10-09-2017

Another point. Have you tried to use the real time kernel?

kubiznak_petr · ‎10-10-2017

Hi Daiane, thank you for your reply.

I have not tested with those kernels since I don't have them. Can you please provide such zImages?

I didn't follow the mutex lifetime, as I don't know how to efficiently do it. I never had a chance to learn something more efficient than kernel logs. How would you do that?

Just by luck, I found that the evbug kernel module might be incident with the issue. When I rmmod evbug, I never see the issue appear. When I insmod evbug again, the issue reappears. I don't think it's the root cause, though.

OtavioSalvador · ‎10-17-2017

If you can, check our current linux-fslc-imx release as it has several fixes. The rocko branch has it up to date.

kubiznak_petr · ‎07-03-2018

Hi Otavio, thank you for your reply. I've made the test with recent linux-fslc-imx (4.1-2.0.x+gitAUTOINC+6c1ad49339-r0) in rocko, with the same result. The imx_sema4.c file equals the one from version 4.1.38.

Can now the issue be escalated? Or is there an option to switch to another kernel with full i.MX6SoloX support?

BiyongSUN · ‎07-03-2018

Below is your descript of "the problem"

It is like a protocol sequence.

It needs virtualization to do so.

Or need lock a complete i2c communication of A9 side. then release to M4. Then M4 finish a complete i2c communication then release to A9.

It could not be lock at anytime or middle of i2c communication. That is why need virtualization.

The hardware semaphore is not to handle a sequence. that is why the log shows lots of try to lock.

It is not a GPIO, it is i2c bus.

If it is just a GPIO, it can just use a hardware semaphore to do so. you can try a GPIO to see what is happening.

That is the last reply from me.