Potential cache coherency problems with linux for T4240?

キャンセル
次の結果を表示 
表示  限定  | 次の代わりに検索 
もしかして: 

Potential cache coherency problems with linux for T4240?

ソリューションへジャンプ
2,207件の閲覧回数
alexray
Contributor II

All,

I have been evaluating a test program on a T4240RDB and I have noticed a consistently reproducible mutex deadlock when I scale it to use multiple CPU cores on the T4240. The test program is essentially an IPC flood test that validates the messaging transport that we use for our middleware. When the program comes up, pub/sub threads are started and affinitized to various CPU cores. pthread_mutexes are used to protect critical sections.

I can run this test on my x86 host running linux across all 32 virtual cores for several hours with no issues. I can also run this test affinitized to two virtual cores on the same CPU and it works in that configuration. But as soon as I run it on a third core, it fails every time.

The deadlock condition exhibits itself as follows:

Thread 1 (affinitized to core 1) locks mutex A' to access shared resource A

Thread 2 (affinitized to core 3) locks mutex B' to access shared resource B. Thread 2 then tries to lock mutex A' to access shared resource A

Thread 1 releases the lock on mutex A' after modifying shared resource A. Thread 1 then tries to lock mutex B' to access shared resource B.

Thread 2 is never notified that mutex A' has been unlocked. The value of Thread 2's mutex(A').__data.__owner points to thread 1.

If we do a stack trace on thread 1, it is on a line of code that occurs after the unlock of mutex A'.

I have noticed this deadlock condition occurs across different threads as well as across different mutexes in my system.

Does anyone have any suggestions on how to proceed with troubleshooting this issue? I'm fairly new to multicore troubleshooting.

Thanks,

Alex Ray

ラベル(1)
0 件の賞賛
返信
1 解決策
2,042件の閲覧回数
alexray
Contributor II

I ended up figuring out the problem using helgrind: (Valgrind​). What an incredibly useful tool! Our middleware implemented a recursive mutex using pthread_mutex and some of the private data members of that wrapper class weren't protected using a separate mutex which resulted in this issue. This question can be closed.

元の投稿で解決策を見る

0 件の賞賛
返信
1 返信
2,043件の閲覧回数
alexray
Contributor II

I ended up figuring out the problem using helgrind: (Valgrind​). What an incredibly useful tool! Our middleware implemented a recursive mutex using pthread_mutex and some of the private data members of that wrapper class weren't protected using a separate mutex which resulted in this issue. This question can be closed.

0 件の賞賛
返信