Potential cache coherency problems with linux for T4240?

alexray · ‎07-22-2015

All,

I have been evaluating a test program on a T4240RDB and I have noticed a consistently reproducible mutex deadlock when I scale it to use multiple CPU cores on the T4240. The test program is essentially an IPC flood test that validates the messaging transport that we use for our middleware. When the program comes up, pub/sub threads are started and affinitized to various CPU cores. pthread_mutexes are used to protect critical sections.

I can run this test on my x86 host running linux across all 32 virtual cores for several hours with no issues. I can also run this test affinitized to two virtual cores on the same CPU and it works in that configuration. But as soon as I run it on a third core, it fails every time.

The deadlock condition exhibits itself as follows:

Thread 1 (affinitized to core 1) locks mutex A' to access shared resource A

Thread 2 (affinitized to core 3) locks mutex B' to access shared resource B. Thread 2 then tries to lock mutex A' to access shared resource A

Thread 1 releases the lock on mutex A' after modifying shared resource A. Thread 1 then tries to lock mutex B' to access shared resource B.

Thread 2 is never notified that mutex A' has been unlocked. The value of Thread 2's mutex(A').__data.__owner points to thread 1.

If we do a stack trace on thread 1, it is on a line of code that occurs after the unlock of mutex A'.

I have noticed this deadlock condition occurs across different threads as well as across different mutexes in my system.

Does anyone have any suggestions on how to proceed with troubleshooting this issue? I'm fairly new to multicore troubleshooting.

Thanks,

Alex Ray

alexray · ‎07-23-2015

I ended up figuring out the problem using helgrind: (Valgrind). What an incredibly useful tool! Our middleware implemented a recursive mutex using pthread_mutex and some of the private data members of that wrapper class weren't protected using a separate mutex which resulted in this issue. This question can be closed.

View solution in original post

alexray · ‎07-23-2015

I ended up figuring out the problem using helgrind: (Valgrind). What an incredibly useful tool! Our middleware implemented a recursive mutex using pthread_mutex and some of the private data members of that wrapper class weren't protected using a separate mutex which resulted in this issue. This question can be closed.

Potential cache coherency problems with linux for T4240?

Potential cache coherency problems with linux for T4240?

QorIQ T4 Devices