I hit on a problem in linux-fslc 4.1.38, available at GitHub - Freescale/linux-fslc: Linux kernel source tree . It is related to the sema4 mutex locking mechanism (drivers/char/imx_amp/imx_sema4.c). The problem only arises when the M4 core is running and its RDC domain is set. I'm running a trivial FreeRTOS program based on the hello_world demo from the NXP's official distribution (https://www.nxp.com/webapp/Download?colCode=FreeRTOS_MX6SX_1.0.1_WIN&appType=license&Parent_nodeId=1... ). The program simply sets the RDC domain and loops forever:
RDC_SetDomainID(RDC, rdcMdaM4, BOARD_DOMAIN_ID, false);
while (1) ;
With Yocto (Morty), I built the qt4e-demo-image (fslc-framebuffer distro). When I boot the image, qtdemoE demo starts. The problem can be reproduced with an intensive I2C communication. In my case, I touch on the screen wildly, generating the touch events. My interrupt-driven touchscreen driver calls i2c_smbus_read_byte_data() to read the data.
After a while, the following message starts printing in the terminal infinitely:
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 137 already locked, wait! num 6 val 1
The FreeRTOS on the M4 core continues without problem. If I stop the debugger and check the SEMA4_Gate06 register, its value is set to 1, indicating it is locked by the A9 core.
After spending hours on this problem, I formed a testing patch (see attached). In the _imx_sema4_mutex_lock() function, it disables the error log for the first 5 occurences. As a result, instead of being imprisoned in the infinite error message loop, I just get a single message and everything seems to work fine:
286 Trying to unlock an unlock mutex.
When this repeats 5-times (which might take a while), the system enters the infinite message loop. Including my added prints, the output looks as follows:
286 Trying to unlock an unlock mutex. 286 Trying to unlock an unlock mutex. 286 Trying to unlock an unlock mutex. 286 Trying to unlock an unlock mutex. 286 Trying to unlock an unlock mutex. drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 6 CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G O 4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12 Hardware name: Freescale i.MX6 SoloX (Device Tree) [<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24) [<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94) [<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4) [<80406be8>] (_imx_sema4_mutex_lock) from [<80406d58>] (imx_sema4_mutex_lock+0xa4/0x128) [<80406d58>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4) [<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c) [<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4) [<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40) [<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28) [<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678) [<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0) [<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8) [<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c) [<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c) [<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50) [<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188) [<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104) [<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c) l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4 lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4 k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4 ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4 u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4 uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4 drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 145 already locked, wait! num 6 val 1. 7 CPU: 0 PID: 66 Comm: irq/203-crtouch Tainted: G O 4.1.38-fslc-01914-gb3f76d9ae87e-dirty #12 Hardware name: Freescale i.MX6 SoloX (Device Tree) [<800184e0>] (unwind_backtrace) from [<80013cfc>] (show_stack+0x20/0x24) [<80013cfc>] (show_stack) from [<8091b2ec>] (dump_stack+0x80/0x94) [<8091b2ec>] (dump_stack) from [<80406be8>] (_imx_sema4_mutex_lock+0x11c/0x1c4) [<80406be8>] (_imx_sema4_mutex_lock) from [<80406d88>] (imx_sema4_mutex_lock+0xd4/0x128) [<80406d88>] (imx_sema4_mutex_lock) from [<80025738>] (clk_gate2_do_shared_clks+0x78/0xe4) [<80025738>] (clk_gate2_do_shared_clks) from [<800258b8>] (clk_gate2_disable+0x58/0x7c) [<800258b8>] (clk_gate2_disable) from [<806b6838>] (clk_core_disable+0x64/0x1e4) [<806b6838>] (clk_core_disable) from [<806b7630>] (clk_disable+0x34/0x40) [<806b7630>] (clk_disable) from [<805d2eec>] (i2c_imx_xfer+0x194/0xe28) [<805d2eec>] (i2c_imx_xfer) from [<805ce5a0>] (__i2c_transfer+0x160/0x678) [<805ce5a0>] (__i2c_transfer) from [<805ceb2c>] (i2c_transfer+0x74/0xa0) [<805ceb2c>] (i2c_transfer) from [<805cf1a4>] (i2c_smbus_xfer+0x590/0x9d8) [<805cf1a4>] (i2c_smbus_xfer) from [<805cf7ec>] (i2c_smbus_read_byte_data+0x3c/0x4c) [<805cf7ec>] (i2c_smbus_read_byte_data) from [<805c07d0>] (crtouch_ts_interrupt+0x90/0x14c) [<805c07d0>] (crtouch_ts_interrupt) from [<8007e824>] (irq_thread_fn+0x2c/0x50) [<8007e824>] (irq_thread_fn) from [<8007eb80>] (irq_thread+0x13c/0x188) [<8007eb80>] (irq_thread) from [<800576b4>] (kthread+0xec/0x104) [<800576b4>] (kthread) from [<8000fca8>] (ret_from_fork+0x14/0x2c) l : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4 lo: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4 k : 97284, 8852a240, clk_gate2_do_shared_clks+0x78/0xe4 ko: 97284, 8852a240, clk_pllv3_do_shared_clks+0x78/0xe4 u : 97282, 8852a240, clk_gate2_do_shared_clks+0xa4/0xe4 uo: 97180, 8852a240, clk_pllv3_do_shared_clks+0xa4/0xe4 ... (repeats infinitely) ...
The output shows the problem is in calling the imx_sema4_mutex_lock() function from two places at once (clk_gate2_do_shared_clks() and clk_pllv3_do_shared_clks()). One comes from the touchscreen interrupt handler, second comes from some worker thread, but I don't know more details about it.
It is obvious that printing anything to the console from an interrupt handler is not okay. So I suggest to disable all debug/error prints in the imx_sema4.c file. Then the infinite message loop would be avoided.
The question is, what is the root cause. Is the debug print really the problem (i.e. it is a valid situation that mutex_ptr->gate_val == SEMA4_A9_LOCK), or does it just expose a problem hidden somewhere else (i.e. the debug print should actually never appear).
As the aforementioned output shows, the condition (mutex_ptr->gate_val != SEMA4_A9_LOCK) in imx_sema4_mutex_unlock() holds true when the problematic situation arises. I guess it is because the gate_val is already 0, as the unlock function has been called twice, which corresponds to the two locking calls. That suggests the locking mechanism does not work correctly.
Is my reasoning correct? How to properly fix the problem? I guess this question might be interesting for OtavioSalvador or Daiane Angolini .
Original Attachment has been moved to: imx_sema4.patch.zip
已解决! 转到解答。
Attached is the patch; it allows more than 1 A9 thread to hold a SEMA4...I can only test on the kernel I have. I put a comment in imx_sema4_lock() where it needs to go to sleep, but since I am not currently having contention issues between the A9/M4, I have not investigated how to put a thread on a wait_q.
I see a problem in the sema4 driver. An A9 thread is never put to sleep if it cannot obtain the sema4. A wait_q is created, and the ISR will wake up any thread on the wait_q, but the driver never puts a thread on the wait_q. In imx_sema4_mutex_lock() it just continuously tries to get the sema4, thus locking out the thread that holds the sema4 and we have a deadlock, in my case it is constantly spitting out the trace:
drivers/char/imx_amp/imx_sema4.c -> _imx_sema4_mutex_lock 137 already locked, wait! num 6 val 1.
If the M4 actually has the sema4, the A9 will be stuck with 1 thread in a tight loop, waiting for it. I'm fairly certain this is not correct behavior. If the A9 cannot obtain the semaphore, it should be put to sleep.
Hi David, thank you for your investigation! I really appreciate it. It doesn't seem that anyone from NXP is ever going to dig into the issue. Do you think you would be able to patch the driver? I would be willing to do it, regrettably I don't know kernel programming that well. I could at least test the patch on version 4.1.38.
Hi Biyong, thanks for your reply. 4.1.15 was an official release, yet it contains exactly the same driver: linux-imx.git - i.MX Linux Kernel (yes, I did compare the files).
And as David recently posted, he's experiencing the same issue.
I'm sorry, I don't really understand what you are suggesting. I have checked the drivers/i2c/i2c-core.c and drivers/i2c/busses/i2c-imx.c files for changes, and they are the same in 4.1.15 and 4.1.38.
Anyway, I am not sure that the bug relates to the I2C operation only. davidpatton, may you please describe your use case in more details?
My use case is an imx6sx with the M4 running "hello world" loaded in u-boot.This M4 binary does nothing with the SEMA4's, just spits out "Hello World" and a count. During linux kernel boot, 2 different threads will call shared clock init's, they check to see if the M4 is out of reset, if it is, they try to acquire a SEMA4, the first thread will get it, when the other thread enters the picture, it will discover that the M4 is running, so it will check the same SEMA4, since the other thread in the A9 already has it, _imx_sema4_mutex_lock() will return -EBUSY, this is bug #1; imx_sema4_mutex_lock() will then spin forever in a tight loop calling _imx_sema4_mutex_lock(), this is bug #2.
We'll start with bug #2, it's fairly simple; in imx_sema4_mutex_lock(), if _imx_sema4_mutex_lock() returns -EBUSY, the thread should be placed in the wait_q to be awakened by the interrupt routine. When it wakes up, it can again attempt to acquire the SEMA4. It cannot simply spin forever; although this did help point out bug #1 probably easier than it would have been otherwise.
For bug #1, _imx_sema4_mutex_lock() should only return -EBUSY when the other core, i.e. the M4, has the SEMA4. It is not a BUSY state for another thread, in the same core, to hold the SEMA4. I resolved this by adding a counting mechanism, first thread actually acquires the SEMA4, other threads attempting to acquire it, just increment the count; when the SEMA4 is being released, only when the count hits 0 is it actually release. I added a print to _imx_sema4_mutex_lock(), the count always goes to 2 during boot, so I believe this is the appropriate action for this deadlock.
Thanks a lot for your reply, David. Could you please share your patch? I'd like to test it with my case and compare with the RT kernel. I would also appreciate if you could test with RT kernel as well, but I understand if you're too busy, of course.
Attached is the patch; it allows more than 1 A9 thread to hold a SEMA4...I can only test on the kernel I have. I put a comment in imx_sema4_lock() where it needs to go to sleep, but since I am not currently having contention issues between the A9/M4, I have not investigated how to put a thread on a wait_q.
Thank you David, perfect. I've tested with both RT and non-RT kernel. In non-RT, the message "imx_sema4_mutex_unlock() - count = 1." sporadically appears, but no deadlock happens. In the RT kernel, I didn't see the message nor deadlock during my test.
I'm having no A9/M4 issues as well, so for my case this is sufficient. Thank you again for helping to solve this issue.
Hi Petr, sorry for my late reply, it was a very busy week.
Have you tried to follow the mutex lifetime? I mean, which other components deal with the same?
Another point, have you tried to reproduce the same using the 4.1.5_2.0.0_GA kernel?
Hi Daiane, thank you for your reply.
I have not tested with those kernels since I don't have them. Can you please provide such zImages?
I didn't follow the mutex lifetime, as I don't know how to efficiently do it. I never had a chance to learn something more efficient than kernel logs. How would you do that?
Just by luck, I found that the evbug kernel module might be incident with the issue. When I rmmod evbug, I never see the issue appear. When I insmod evbug again, the issue reappears. I don't think it's the root cause, though.
Hi Otavio, thank you for your reply. I've made the test with recent linux-fslc-imx (4.1-2.0.x+gitAUTOINC+6c1ad49339-r0) in rocko, with the same result. The imx_sema4.c file equals the one from version 4.1.38.
Can now the issue be escalated? Or is there an option to switch to another kernel with full i.MX6SoloX support?
Below is your descript of "the problem"
It is like a protocol sequence.
It needs virtualization to do so.
Or need lock a complete i2c communication of A9 side. then release to M4. Then M4 finish a complete i2c communication then release to A9.
It could not be lock at anytime or middle of i2c communication. That is why need virtualization.
The hardware semaphore is not to handle a sequence. that is why the log shows lots of try to lock.
It is not a GPIO, it is i2c bus.
If it is just a GPIO, it can just use a hardware semaphore to do so. you can try a GPIO to see what is happening.
That is the last reply from me.
The output shows the problem is in calling the imx_sema4_mutex_lock() function from two places at once (clk_gate2_do_shared_clks() and clk_pllv3_do_shared_clks()). One comes from the touchscreen interrupt handler, second comes from some worker thread, but I don't know more details about it.
It is obvious that printing anything to the console from an interrupt handler is not okay. So I suggest to disable all debug/error prints in the imx_sema4.c file. Then the infinite message loop would be avoided.
The question is, what is the root cause. Is the debug print really the problem (i.e. it is a valid situation that mutex_ptr->gate_val == SEMA4_A9_LOCK), or does it just expose a problem hidden somewhere else (i.e. the debug print should actually never appear).
As the aforementioned output shows, the condition (mutex_ptr->gate_val != SEMA4_A9_LOCK) in imx_sema4_mutex_unlock() holds true when the problematic situation arises. I guess it is because the gate_val is already 0, as the unlock function has been called twice, which corresponds to the two locking calls. That suggests the locking mechanism does not work correctly.
Is my reasoning correct? How to properly fix the problem? I guess this question might be interesting for OtavioSalvador or Daiane Angolini .
Original Attachment has been moved to: