Peter Vranken

MPC5775B/E, Experiences with D-Cache and asynchronous MC Exceptions from MPU

Discussion created by Peter Vranken on Jan 27, 2020

Here are some experiences from migrating an OS kernel from MPC5643L
(e200z4) to MPC5775B/E (e200z7), which could hopefully be of some common
interest. Note, we can't give any guarantee that everything said down here
is fully correct, so please feel free to correct and comment, wherever
appropriate.

 

The major differences between the two cores or MCUs are the presence of
the data cache only in the z7, the additional store buffers of the z7 and
the MPUs. The MPC5775B/E has two MPUs and they are differently connected
inside the MCU.

 

1) MPU

 

The reference manual (RM) of the MPC5775B/E tells little or nothing about
how the two MPUs cooperate. Actually, they seem to listen solely to those
addresses, which belong to the devices behind them. Access to any other
address from the 32 Bit address space is not affected at all, and
particularly, it cannot be protected by the MCUs.

 

Each MCU has one half of the RAM space and one peripheral bridge behind
it. Further details are well-documented in the RM. The memory area
descriptor needs to be written to the respective MPU. To grant access to
the entire I/O space, one would have to put an area descriptor in each of
the MPUs.

 

If an area descriptor relates to a portion of RAM then one has to look at
the start and end address of the area: The area can belong either entirely
into the scope of MPU 1 or entirely into the scope of MPU 2 or partly into
both their scopes. So we have to put either one complete area descriptor
into one of the MPUs or we have to place two descriptors of partial areas,
one into each MPU.

 

The area descriptors have dedicated bit fields to specify the access
separately for the two z7 cores of the MPC5775B/E, so it's quite
transparent how to handle the dual-core issue with respect to MPU (in
factory state, the MPC5643L has just one logical core).

 

RAM and peripheral bridges sit behind the MPUs and access can be managed
by the MPUs but it is not possible to use the MPU to restrict accesses to
the ROM. If ROM access should be restricted then the MMU needs to be used
instead. While easy in configuration and handling, the MMU has the
disadvantage of supporting only power-of-two sized areas, where the MPU
supports areas of arbitrary size.

 

2) D-Cache and Store Buffers

 

All experiments have been made with the D cache in write-through mode. All
the CPU's stores go into the according cache line and at the same time
into the next store buffer, from where they are written onto the bus as
soon as the bus traffic permits and as soon as the particular store buffer
is the next one to be served. (The store buffers are served in order of
being filled, they form a FIFO.)

 

1. Problem: The actual write to the RAM happens significantly later than
the CPU store instruction (can be dozens of instructions later). If we use
MPU access protection for RAM address space then we see a potential access
violation exception ("Machine Check" exception, or IVOR #1) at a time,
when it is practically impossible to still relate the exception to the
particular failing instruction. The failure information stored in the MPU
is useful for debugging but not sufficient to make this safely possible.

 

This effect complicates the implementation of a failure handler but
doesn't affect the normal execution of an error-free program, like the

 

2. Problem indeed does: The store buffers keep track of the address and
value to write and of the problem state of the CPU at the time of writing
(supervisor vs. user mode). Implicitly - the buffers belong to the core
itself - they contain the information, which core does do the store.
Unfortunately they do not store the information under which process ID
(core's register PID0) the store was made (see Core Reference Manual,
section 11.9, p. 613).

 

"Unfortunately", because the process ID is like all other mentioned pieces
of information an attribute of the MPUs' area descriptors. For access
permission check, the MPU won't refer to the process ID at the time of the
store instruction but always to the current one. Due to this limitation of
the store buffers, a write to the MPU may appear at the time of being
checked by the MPU different to how it was actually issued by the CPU's
store instruction. (If only the contents of CPU register PID0 have
meanwhile changed.) If process ID based memory access control is used,
then it may easily happen that the MPU raises an exception on a perfectly
correct memory access.

 

3. Problem: As said, a CPU store instruction writes to both, D cache and
store buffer. The time delay between the store instruction and the MPU
access check means a significant time span at which the CPU is exposed to
corrupted, invalid memory contents - despite of the presence of an MPU; the
cache contents hold the result of the forbidden store but the MPU has not
yet detected or reported it.

 

Although this sounds quite critical, for our particular kernel only a few
machine instructions needed to be to changed for successful migration from
MPC5643L without data cache to MPC5775B/E with cache; it's all about
synchronization.

 

The instruction mbar has the effect of flushing all store buffers. It
completes only when all store buffers have been written to the memories or
peripheral registers (at least to the extend that it is decided that the
write will succeed without exception).

 

The mbar instruction solves problem 2. - just place an mbar immediately in
front of all instructions, which may alter register PID0, effectively only
an mtspr 48,rX.

 

The other problems can't be generally solved, but the solution for our
particular kernel will still have some generality. A solution was possible
because the kernel doesn't claim to somehow repair the operation of
failing code, i.e. code, which leads to an exception. For a repair we
would need to know all details about the failing instruction, which is not
possible. Instead, the kernel just wants to recognize failing processes
in order to stop them - like a Windows "Abnormal Program Termination" or a
Unix "Core Dump". And it must of course ensure that it can't harmfully
affect any other process (see 3rd problem).

 

The solution founds on the fact that switching between different processes
only happens under control of the kernel itself, we simply know, when this
happens. Effectively, a switch will always be initiated by an interrupt,
mainly an External Interrupt (IVOR #4) or a system call (IVOR #8). The
mbar instruction is needed at the beginning of both interrupt handlers:
All memory write operations, which are still pending in the store buffers
and which are candidates of still happening exceptions are completed. If
one of the flushed store buffers should indeed cause an exception then the
mbar instruction is the very instruction that is preempted by the Machine
Check or IVOR #1 exception. If it is placed on entry into the interrupt
handler then no process switch took place yet and the MC exception handler
knows that the faulty process is the (still) current one.

 

Because of the 3rd problem, it is most important that the mbar instruction
is placed prior to the first RAM access (either, load or store). All
contents of the cache needs to be doubted until all exceptions have been
handled - which is guaranteed only behind the mbar.

 

Note, it's not a contradiction that RAM load instructions before the mbar
may suffer from violated cache contents - those would still belong to the
same failing process, which is anyway going to die as a consequence of the
exception it caused.

 

We claimed an mbar on entry into IVOR #4 and #8, but actually, most other
CPU interrupts are also candidates for a process switch and we need to
place an mbar there, too. Imagine, a process contains an instruction that
causes an MMU page fault. It'll branch e.g. into the IVOR #13 exception.
This means a process switch from failing user process to kernel process.
If the failing instruction in the left process was preceded by a load or
store causing an MPU access violation, then the MPU raised exception would
come significantly after entry into the IVOR #13 exception. Only with help
of the synchronizing mbar the kernel can safely decide that both the
exceptions originate from the user process. (Note that the IVOR #13 still
precedes the MPU exception, but the kernel can recognize that the very
first instruction of the #13 handler, the mbar, has been preempted and can
take this as indication that the current user process failed.)

 

Moreover, even the handler for the MPU exceptions, the Machine Check or
IVOR #1 handler, requires an mbar on entry and for the same reason: The
failing process may have issued several access violating store
instructions shortly one after another (think about a memcpy to a
protected region). Without the mbar the MPU would raise a further
exception, while the IVOR #1 handler is still in the middle of processing
the first one. (And this cannot be tackled by just keeping bit MSR[MC]
zero for a while: see side-effect of write access violation, i.e. 3rd
problem.)

 

By the way, if we anyway have an mbar instruction on entry into all
exception handlers then we no longer need those in front of the
instruction mtspr 48,rX to solve the 2nd problem: The critical instruction
to alter register PID0 is surely not used outside of any of these handlers
and the mbar instruction on entry into the handlers already fulfill the
demand of not having a store buffered across a change of PID0.

 

With the mbar on entry into the exception handlers we solve the first two
problems, but the 3rd problem still remains.

 

A good deal of the 3rd problem is already solved with the same mbar
instructions on entry into the exception handlers: At latest after the
mbar we can be sure that there are no pending exceptions any more, which
originate from the left user process. However, bad stores may have been
issued by this process, which corrupted parts of the data cache.
Fortunately, if so, then (and again because of the mbar) we surely got
into the MC handler, IVOR #1. Here, and necessarily before we use the
possibly corrupted RAM the very first time, we can invalidate the cache.

 

Consider, the MPU may report an access violation much later then desirable
but it still ensures that the bad load or store doesn't have an effect
behind the MPU; neither an I/O device's register is touched nor the main
memory contents. By invalidation of the cache, we force the cache
management to fetch the same data newly from main memory at next use
(whenever that will be). And in main memory, it has surely not been
corrupted.

 

The MPUs report the address of the access violation - but this information
is incomplete and must not be used to invalidate only the according cache
line. Several store buffers can have corrupted many different cache lines
at a time and only the last recent failure is reported by the MPUs. It's
unavoidable to invalidate the entire D cache contents.

 

Note, saving the context in the IVOR #1 handler but not using the RAM is
no contradiction. Only two registers(CR and a GPR) are needed to do the
cache invalidation and these can be saved e.g. in USPR registers or in the
(redundant) GPRs 2 and 13. Or a cache line may be locked. Or unaffected,
uncached RAM may be defined.

 

After forcing all possibly pending exceptions to occur and having
invalidated the D cache in case of MPU exceptions, we have caught and
repaired all failures and can continue in the usual way.

 

Summarizing:

 

- Each MPU only protects the address space of one peripheral bridge (i.e.
  half the I/O devices) and half the RAM address space
- RAM and I/O protection rules (memory area descriptors) need to address
  to the according MPU
- ROM access is restrictable by MMU only
- MPU access violation exceptions are significantly delayed and can't be
  related to a particular instruction
- The on-entry-to-exception-handler-mbar instructions ensure that all
  possibly pending exceptions, which originate from the preempted, left
  process have been raised before completing the mbar
- Behind the mbar, all raised MPU exceptions will have been handled by
  invalidation of the possibly corrupted D cache
- Behind the mbar, all side-effects of the exception causing instructions
  are either suppressed (trapped by the CPU, blocked by the MPU) or
  eliminated (D cache invalidation)
- The failing process is safely identified despite of the delayed
  exception

Outcomes