T2080: Write never observed by other cores

roffelsen · ‎07-01-2022

I have a condition that is causing cores to become stuck in loops waiting on each other because a 32-bit write from one of the cores is never observed by the other core. I used the following sample code to demonstrate the problem:

static INLINE void loadStore_storeBarrier() { asm volatile(".long 0x7C0104AC":::"memory"); } //esync 0x1/sync 0, 0x1
static INLINE void load_loadStoreBarrier() { asm volatile(".long 0x7C0804AC":::"memory"); }  //esync 0x8/sync 0, 0x8

#define TotalNumCores 4
uint32_t _runningCore = 0;
uint8_t interferenceMemory[64] __attribute__((aligned (64))); // Cache line align up and size to make sure it is not on the same cache line with _numWaitingCores.

// Spin waiting for *addr == value.
static NOINLINE void waitUntilEqual(volatile uint32_t *addr, uint32_t value)
{
  while (*addr != value)
  {
    interferenceMemory[1] = 0xDE;
    asm volatile("":::"memory"); // force the compiler to write to interferenceMemory each pass through the loop.
  }
}

// Waits until _runningCore == currentCore.
// Preconditions:
//  - must be called with processor interrupts disabled
// Postconditions:
//  - An acquire fence (load_loadStoreBarrier()) has occurred.
static NOINLINE void serializeStart(uint32_t currentCore)
{
  waitUntilEqual( &_runningCore, currentCore);
  load_loadStoreBarrier(); // acquire operation is _runningCore==currentCore
}
// Increment _runningCore to notify the next core it can start the serialized operation.
// and wait for all cores to call serializeEnd().
// Preconditions:
//  - must be called with processor interrupts disabled
static NOINLINE void serializeEnd(uint32_t currentCore)
{
  loadStore_storeBarrier();   // release operation is _runningCore=
  interferenceMemory[0] = 0xDE;
   asm volatile("":::"memory"); // force the compiler to write to interferenceMemory before numWaitingCores is incremented.
   _runningCore = _runningCore + 1; // Only one core can modify _runningCore at a time so no need for an atomic add.
   waitUntilEqual( &_runningCore, TotalNumCores);
}
  ... 
  // memory is configured to be write back cached and Memory Coherence required
  // Only 1 thread per core is enabled
  // Core are numbered 0-3. currentCoreNumber is the number associated with the calling core.
  // processor interrupts are disable (i.e. msr[EE] == 0)
  serializeStart(currentCoreNumber);
  ... // Do some work
  serializeEnd(currentCoreNumber);

When I run this code core 0 gets stuck in the waitUntilEqual() call by serializeEnd() and all other cores are stuck in the waitUntilEqual() called by serializeStart(). If I make any one of the following modifications to the code it runs as expected:
1) Comment out either (or both) write(s) to interferenceMemory.
2) add a "miso" after the _runningCore = _runningCore + 1;
3) change the _runningCore = _runningCore + 1; to use a lwarx/stwcx. to perform an atomic add.
4) Enable both threads on each core and increase then number of cores to 8.

Is this expected processor behavior? I can find no documentation stating how long a write from one core will take to be visible by another core but it seems odd that that time would be indefinite.

yipingwang · ‎08-04-2022

The piece of code shared by customer is partial.
Has customer used Codewarrior tool to run this code? If yes, please ask them to share the source code of this.
Also, we would like to know the MMU settings of the addresses written in this code.

roffelsen · ‎08-23-2022

> Has customer used Codewarrior tool to run this code?

No, I do not have Codewarrior.

> Also, we would like to know the MMU settings of the addresses written in this code.

In my example _runningCore and interferenceMemory are on the same page. The MMU setting of each core for the page is:

core 0:
address| physical |ts|tid|tgs|tlpid| idx|way|set| pagesize | IPROT | WIMGE | U0123 | URWX | SRWX | X01 | mas1 mas2 mas3 mas7
01812000--01812FFF| 06182000--06182FFF | 1| 01| 0| 00| 292| 5| 12| 4 KB | - | --M-- | U---- | URWX | SRWX | X-- | 80011100 01812004 0618203F 00000000

core 2:
address| physical |ts|tid|tgs|tlpid| idx|way|set| pagesize | IPROT | WIMGE | U0123 | URWX | SRWX | X01 | mas1 mas2 mas3 mas7
01812000--01812FFF| 06182000--06182FFF | 1| 01| 0| 00| 092| 1| 12| 4 KB | - | --M-- | U---- | URWX | SRWX | X-- | 80011100 01812004 0618203F 00000000

core 4:
address| physical |ts|tid|tgs|tlpid| idx|way|set| pagesize | IPROT | WIMGE | U0123 | URWX | SRWX | X01 | mas1 mas2 mas3 mas7
01812000--01812FFF| 06182000--06182FFF | 1| 01| 0| 00| 092| 1| 12| 4 KB | - | --M-- | U---- | URWX | SRWX | X-- | 80011100 01812004 0618203F 00000000

core 6
address| physical |ts|tid|tgs|tlpid| idx|way|set| pagesize | IPROT | WIMGE | U0123 | URWX | SRWX | X01 | mas1 mas2 mas3 mas7
01812000--01812FFF| 06182000--06182FFF | 1| 01| 0| 00| 092| 1| 12| 4 KB | - | --M-- | U---- | URWX | SRWX | X-- | 80011100 01812004 0618203F 00000000

One thing I discovered while reproducing my results so I could collect this data is if L1 was disable on all cores the code completed as expected.

yipingwang · ‎08-29-2022

Refer EREF_RM.pdf from www.nxp.com
This document does not mention the attributes 'Way' and 'Set'. What is the purpose of setting these attributes?
Our suggestion is to remove these attributes and try to run the code.
There is a possibility that these affects the MMU settings and that is why core get stuck.

rweiss · ‎08-29-2022

yipingwang: Check the e6500 core reference manual. The L2 TLB array TLB0 is 1024-entry, 8-way set-associative unified. I.e. is has sets and ways like a cache.

yipingwang · ‎09-04-2022

We wanted to rule out any issue with MMU settings.
If there is any CodeWarrior project that you have used for this scenario we would have been able to made some changes in that code and would have been able to see its effects.
Since that is not possible we will create a test code for this scenario in CodeWarrior tool. This might take some days from our side.

yipingwang · ‎10-09-2022

I am not able to reproduce this issue.

A 32-bit write from one of the cores is observed by the other core. I have written a sample code in CodeWarrior_Power_PC. Attached the zip file.

Core 0 checks whether DDR location data is 0x1 or not in loop.

Core1 writes data 0x1 to that particular DDR location after some delay.

Both the cores are running simultaneously.

After data is changed it prints 'Success'.

Please make a similar CW program if there is any additional query.

Also, the Linux kernel supports symmetric multi-processing (SMP) therefore a 32-bit write from one of the cores is observed by the other core.

Steps to run:

Install CodeWarrior Suite for Power Architecture 2. There is a .cproject file in t2_main folder. Click on this to open the CW project.
Run Debug configuration 't2_main_RAM_core00_T2080_Download' and 't2_main_RAM_core00_T2080_Download'.
select 'multicore resume' and run both the cores.

roffelsen · ‎10-10-2022

I can not see the attached zip file.

yipingwang · ‎10-10-2022

Please refer to the attachment.

roffelsen · ‎10-11-2022

I don't have code warrior so I can't provide you with a working example. What I can tell you is the code you wrote did not have something like interferenceMemory written just before core 2 writes to *addr and then written in a tight loop after core 2 writes to *addr. As I documented, with out either of these writes to interferenceMemory the code I provided would complete as expected. From the outside it appears the processor is continually accelerating the writes to interferenceMemory ahead of the write to *appr because there was a write pending to the same cache line prior to the write to *appr.

roffelsen · ‎08-29-2022

The MMU data I sent you was collected using a Lauterbach ICD at the point the problem is observed. While the EREF_RM does not mention way/set, the e6500 manual does. All of the TLB entries provided are from TLB0. According the the e6500 manual, for TLB0 the set is determined by the Virtual Address and the way is selected using MAS0[NV]. The OS I am using uses the e6500's buit-in round-robin replacement for TLB0 and auto update of MAS0[NV] on a TLB error interrupt.