How to use the D cache in an MPC5748G multi-core MCU

peter_vranken · ‎10-21-2018

Dear NXP Team,

I'm starting the development with a MPC5748G device. The cores are running
with enabled I and D caches. I would like to understand, which
possibilities I have to realize some inter-core communication.

1) The easiest way seems to be using dedicated uncached memory areas and some
critical section mechanisms (using the semaphores) or memory barriers (using
mbar), depending on the kind of data flow. This sounds straightforward but
where are the unexpected pit-falls?

2) My understanding of the D cache is that a writing core will put the
data at the same time into its cache and the main memory behind. If this
is right then it would become possible to safely implement uni-directional
data flow with a shared memory which is used with D cache at the producer
side and without cache for the consumer. Right?

Is the secondary storage into main memory done immediately in the same
write cycle as the update of the cache contents? Or is the secondary
storage in main memory subject to some buffer and flush strategy so that
the ordering of the storage in main memory could differ from the ordering
of the primary storage in the cache? This concern leads to the next
question:

If we use a memory barrier based notification (e.g. first update payload
data, then put a barrier, finally update the notification flag), will the
guarantee that the CPU first completely writes the data and only then
writes the flag still hold for the main memory (i.e. the secondarily
written storage)?

3) The tight coupling of cores and memory with the xbars tempts to
consider the complete RAM space as a shared memory. I wonder, if this can
be implemented with D cache on? Is there a hardware mechanism that
notifies the cache of core A that its contents became invalid because of a
write to the according addresses by core B? Is there otherwise a software
way of notifying (B to A) or invalidating the other core's cache?

4) Does it make any difference if DMA is used to write some data into RAM?
Will cached read of the DMA destination address area fail? Or is there a
hardware mechanism that invalidates the cache of the reading core so that
it really gets the DMA written information? Or is the concept that a core
itself invalidates its own cache after it received the
DMA-completed-interrupt but prior to reading the DMA written contents?

If so, this concept could be implemented for core-to-core, too, using a
software interrupt. Right? Is this a typical or even recommended way to
do?

5) Any more hints? Available, specific documentation on these topics?

Best regards

Peter

lukaszadrapa · ‎11-22-2018

Hi,

1. First important point is that there’s no hardware support for cache coherency. Some devices have Cache Coherency Unit but this is not the case of MPC5748G.
So, the best way is to have cache inhibited area in RAM for shared resources. This will save a lot of effort. Otherwise it would be necessary to invalidate the cache anytime when you are going to read it because the RAM could be changed by another core or by DMA.
2. The cache supports only write-through mode, so uni-directional data flow is possible. There’s won’t be coherency issues in this case.
Write to RAM is not done in the same cycle as write to cache. Search for “late-write buffer” in the reference manual for more details.
3. As mentioned above, there’s no cache coherency unit.
4. It doesn’t matter if data are changed by other core or by DMA.
5. I’m not aware of such documentation…

Regards,
Lukas

1045302770 · ‎12-11-2020

Hi Lukas,

I want to know why '4. It doesn’t matter if data are changed by other core or by DMA'.

In my project, I write the data periodicly on core1(with no cache), and read it continuously on core0(with cache enabled), it is found that core0 can get the updated data. I wonder why the cache is updated on core1, is there any hardware mechanism to update it? Is there any hint about this in the referrence manual?

Looking forward for your reply!

Best regards

Victor

peter_vranken · ‎11-23-2018

Hi Lukas,

Thanks a lot for your clarifying response. Meanwhile I continued with
comparing the store with reservation and the decorated stores for
signaling purpose and found the latter advantageous:

- More transparent documentation of behavior with respect to cache and
other cores
- Very lean in use
- Using GCC inlined assembler the instructions are very well integrated
into the emitted machine code

By the way, a minor disadvantage is the S32DS disassembler, which doesn't
recognize all of the instructions and displays them as byte code rather
than as instruction. This is likely due to the fact that the EREF document
doesn't specify the instructions with cache bypass.

Although it works well so far some questions arise:

- For sake of simplicity of use I would prefer to only use the cache
bypassing instructions. If these instructions are used in a cache
inhibited memory region then the normal instruction (without "cb" should
be equivalent. Question: Does the use of the cache bypassing instruction
have a disadvantage in this case? Will it e.g. take more clock ticks to
complete?
- I found nothing said about memory ordering aspects of these operations.
How do they relate to normal loads and stores? My main intention is the
construction of mutex and semaphores, so will I need to combine the
decorated load and store instructions with an additional "msync" to get
a safe critical section?
- I'm building some intrinsic inline functions for embedding the
instructions in C code. (Example below.) Question: Maybe a

well-elaborated collection of such functions does already exist?

Regards

Peter

Sample code:

/**
* XOR an 8 Bit word with specified operand at a given address in an atomic and uncached
* way. After return from this function the modified word will really be in main memory
* from where it can be read from the same or another core using std_loadByteAtomic().\n
*   This function can be used to implement the release of a set of mutexes in a vector of
* up to 8 mutexes.
*   @param operand
* The operand of the XOR. For the principal use case of a mutex implementation this word
* would contain a 1 bit for all previously acquired and now released mutexes and 0 bits
* anywhere else.
*   @param address
* The address in main memory at which to apply the XOR operation. It doesn't matter
* whether the memory region is cached or cache inhibited.
*/
static inline void std_xorByteAtomic(uint8_t operand, uint8_t *address)
{
    /* Decoration for Logical Exclusive-OR (XOR), see RM 18.3.1.1.3. */
    const uint32_t decoration = 0xf0000000;

    /* See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html */
    asm volatile ( /* AssemblerTemplate */
                   "stbdcbx %0, %1, %2\n\t"
                 : /* OutputOperands */
                 : /* InputOperands */ "r" (operand), "r" (decoration), "r" (address)
                 : /* Clobbers */ "memory"
                 );

} /* End of std_xorByteAtomic */

/**
* Release a previously acquired, single mutex.
*   @param pMutex
* The mutex object by reference.
*   @remark
* This function is meant a basic implementation only. It doesn't perform any checks. If a
* mutex is released which had not been successfully acquired before using
* mtx_acquireMutex() then the mutual exclusion functionality will fail.
*/
static inline void mtx_releaseMutex(mtx_mutex_t *pMutex)
{
    /* A memory barrier here ensures that all operations inside the critical section, which
       is spawn with the mutex, have completed before we return the mutex. */
    std_fullMemoryBarrier();

    /* We can unconditionally reset the bit: We own the mutex and nobody else will
       interfere on the given bit. */
    std_xorByteAtomic(/* operand */ 0x01, /* address */ pMutex);

} /* End of mtx_releaseMutex */