SRR1 error in Exception of MPC8270

poloyellow · ‎05-26-2017

To whom it may concern,

Thank you for your attention to this letter. I have some questions about MPC 8270 and need your help.

On the mpc8270, we designed a multi-task application. It consists of the following tasks:

-Task A: it works periodically, issues a PCI read request, and sends a heartbeat to task B. In a failed state, the PCI device is turned off, so task A triggers the machine check exception (0x200) because the TEA signal, and cannot send the heartbeat.

-Task B: it receives heartbeat, and deletes and rebuild task A when reception is unsuccessful.

-Interrupt: the system has 250ms timer interrupter, which occurs periodically and triggers the external interrupter exception (0x500).

Since the 0x200 exception is an asynchronous exception, it will occur within several instructions after the PCI read request instruction. In the fault scene, we detected that a very short time (less than 200ns) after the 0x500 exception trigger, and the machine detection exception was also triggered. We observed that the CPU responds with the 0x200 exception and performs exception handling. In the exception handling, register SRR0 value is one of the instruction addresses of the task A, not the address of external interrupter exception. But the register SRR1 is an error value (0x1000) and is not the MSR state in the running of task A(the correct value should be 0xB932). Due to the wrong SRR1 value, the system crashes in subsequent execution.

Since 0x200 has a higher priority than 0x500, we suspect that hardware logic has encountered an anomaly error during 0x200 preemption of 0x500 exception. This error causes SRR1 to store the state of 0x500, but not the correct program status.

We would like to know whether the hardware has the fault we have guessed, and how the software should do to avoid the problem in the current situation.

Look forward to your feedbacks and suggestions soon.

Very truly yours.

Polo Yellow

alexander_yakov · ‎05-26-2017

Machine check exception is not guarantied to be recoverable. The following is said in G2 Core Reference Manual, Section 5.5.2:

Note that the G2 core makes no attempt to force recoverability on a machine check;
however, it does guarantee that the machine check exception is always taken immediately
upon request, with a nonpredicted address saved in SRR0, regardless of the current
machine state. Because pending stores in the store queue (see Figure 7-4) are not canceled
when a machine check exception occurs, two consecutive stores that result in the assertion
of core_tea can cause the processor to checkstop. To prevent a checkstop in this case, a sync
instruction must be placed between two stores that can result in assertion of core_tea.

Software can use the machine check exception in a recoverable mode to probe memory. For
this case, a sync, load, sync instruction sequence is used. If the load access results in a
system error (for example, the assertion of core_tea), the processor can handle this in a
recoverable state. If the sync instruction is not used, a second access to the same address as
the first load could cause the processor to enter the checkstop state.

Here is a direct link to this document:

http://www.nxp.com/assets/documents/data/en/reference-manuals/G2CORERM.pdf

Have a great day,
Alexander
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

View solution in original post

alexander_yakov · ‎05-26-2017

Machine check exception is not guarantied to be recoverable. The following is said in G2 Core Reference Manual, Section 5.5.2:

Note that the G2 core makes no attempt to force recoverability on a machine check;
however, it does guarantee that the machine check exception is always taken immediately
upon request, with a nonpredicted address saved in SRR0, regardless of the current
machine state. Because pending stores in the store queue (see Figure 7-4) are not canceled
when a machine check exception occurs, two consecutive stores that result in the assertion
of core_tea can cause the processor to checkstop. To prevent a checkstop in this case, a sync
instruction must be placed between two stores that can result in assertion of core_tea.

Software can use the machine check exception in a recoverable mode to probe memory. For
this case, a sync, load, sync instruction sequence is used. If the load access results in a
system error (for example, the assertion of core_tea), the processor can handle this in a
recoverable state. If the sync instruction is not used, a second access to the same address as
the first load could cause the processor to enter the checkstop state.

Here is a direct link to this document:

http://www.nxp.com/assets/documents/data/en/reference-manuals/G2CORERM.pdf

Have a great day,
Alexander
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

poloyellow · ‎05-30-2017

I reread the 5.5.2 section of the G2 core reference manual and have understood the implication for SRR0. I think your reply solved my question and thank you very much.

By the way, in the e500 core, the registers MCSRR0 / 1 are for the machine check exception, Do they guarantee the consistency of machine status and operating address? Thank you!

poloyellow · ‎05-30-2017

Hi, I'm glad to receive your reply. I agree that Machine check exception is not guaranteed to be recoverable. But in my opinion, the exception scene should be consistent, that is, SRR0 and SRR1 are paired. Assume that when the 0x200 exception occurs, the application's PC is 0x1000000, and its machine state(MSR) is 0xB932, then SRR0 should be 0x1000000, and SRR1 should be 0x4B930 (for TEA), or the SRR0 should be 0x500, and SRR1 should be 0x4100 when the interrupter happened. However, in the test case, SRR0 is always 0x1000000, and SRR1 is 41000. The two registers are not paired. Is this state normal ?