MPC8641D Bus Error

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MPC8641D Bus Error

861 Views
jeffbaker
Contributor I

Hi,

We have a customer experiencing field issues on the MPC8641D that we believe may be a problem in the hardware.  The issue is critical for them but is extremely difficult to reproduce in a lab.  We have a test scenario that after many months has been reduced to an 8-hour failure.  Here is a brief description from our kernel engineer.  Worth noting is that when we disable speculative data prefetch (HID0[SPD]) on each core the issue can no longer be reproduced.

"With the S0 output, the MSSSR0 register is indicating the TEA signal was asserted but none of the registers that indicate things that feed into TEA are indicating an error. For the USB1 file, the EDR bit LAE bit is on but it’s occurring very early after a context synchronizing instruction (the SC instruction that started a kernel call sequence) and there’s no instructions anywhere in the vicinity that would cause a reference to the address that the ELADR register is showing."

"S0" and "USB1" being the names of the units involved in the test.

Thanks,

Jeff

0 Kudos
3 Replies

530 Views
williammartin
Contributor I

Was this resolved?  What was the reason for the MSSR and TEA bits being set?

0 Kudos

530 Views
LPP
NXP Employee
NXP Employee

>"S0" and "USB1"

This is not informative. During the reporting of the error, (outside of the MSSR indicating a TEA), was there specifically a 0x200 exception reported? That would help narrow it down to L1 or L2 cache activity. At machine check, investigate associated registers to determine error source (dump of srr1,l2errdet, msssr0 registers).

Any IP block that can be associated with a read transaction error will assert TEA to the requesting core in the event of a read error. This is independent of any error reporting via MPIC.

The following IP could assert TEA_ for a read response where an error occurred:

MCM - error detect register at h'0_1E00

DDR1 - error detect register at h'0_2E40

DDR2 - error detect register at h'0_6E40

LBC - error detect register at h'0_50B0

PEX1 - error detect register at h'0_8E00

PEX2 - error detect register at h'0_9E00

SRIO - error status registers at h'C_0158, h'C_0608, h'C_060C and h'C_0640

This is from our understanding of the IP blocks. There would be no TEA_ assertion that would not be evident in an error detect register.-

0 Kudos

530 Views
brianstecher
Contributor I

Yes there was a 0x200 exception reported. This is why the OS crashed - we took the machine check exception and the MSR[RI] bit indicated that the system was not in a recoverable state

This is a dumping of HID and CSSR register values at the time of the crash for the S0 machine

SPR 1008 = 0x8412c1bc (HID0)
SPR 1009 = 0x00010c80 (HID1)
SPR 1014 = 0x00008020 (MSSCR0)
SPR 1015 = 0x00001000 (MSSSR0)
SPR 1011 = 0x00000000 (ICTRL)
SPR 1017 = 0xb0000000 (L2CR)
SPR 0988 = 0x00000000 (L2CAPTDATAHI)
SPR 0989 = 0x00000000 (L2CAPTDATALO)
SPR 0990 = 0x00000000 (L2CAPTECC)
SPR 0991 = 0x00000000 (L2ERRDET)
SPR 0992 = 0x00000000 (L2ERRDIS)
SPR 0993 = 0x00000000 (L2ERRINTEN)
SPR 0994 = 0x00000000 (L2ERRATTR)
SPR 0995 = 0x00000000 (L2ERRADDR)
SPR 0996 = 0x00000000 (L2ERREADDR)
SPR 0997 = 0x00000000 (L2ERRCTL)
SPR 0538 = 0x340007fe (DBAT1U)
SPR 0539 = 0xfc00002a (DBAT1L)
SPR 0540 = 0x300007fe (DBAT2U)
SPR 0541 = 0xf800002a (DBAT2L)
SPR 0542 = 0xfffc0003 (DBAT3U)
SPR 0543 = 0x00040011 (DBAT3L)
CCSR 0x00000c08 = 0x00000000 (LAWBAR0)
CCSR 0x00000c10 = 0x00000000 (LAWAR0)
CCSR 0x00000c28 = 0x00000000 (LAWBAR1)
CCSR 0x00000c30 = 0x80f0001c (LAWAR1)
CCSR 0x00000c48 = 0x00080000 (LAWBAR2)
CCSR 0x00000c50 = 0x8000001b (LAWAR2)
CCSR 0x00000c68 = 0x000c0000 (LAWBAR3)
CCSR 0x00000c70 = 0x80c0001b (LAWAR3)
CCSR 0x00000c88 = 0x000f8100 (LAWBAR4)
CCSR 0x00000c90 = 0x80400015 (LAWAR4)
CCSR 0x00000ca8 = 0x000e2000 (LAWBAR5)
CCSR 0x00000cb0 = 0x80000017 (LAWAR5)
CCSR 0x00000cc8 = 0x000e3000 (LAWBAR6)
CCSR 0x00000cd0 = 0x80100017 (LAWAR6)
CCSR 0x00000ce8 = 0x000fe000 (LAWBAR7)
CCSR 0x00000cf0 = 0x80400018 (LAWAR7)
CCSR 0x00000d08 = 0x00020000 (LAWBAR8)
CCSR 0x00000d10 = 0x8160001c (LAWAR8)
CCSR 0x00000d28 = 0x00090000 (LAWBAR9)
CCSR 0x00000d30 = 0x8010001b (LAWAR9)
CCSR 0x00001e00 = 0x00000000 (EDR)
CCSR 0x00001e08 = 0x00000000 (EER)
CCSR 0x00001e0c = 0x00125000 (EATR)
CCSR 0x00001e10 = 0x48061000 (ELADR)
CCSR 0x00001e14 = 0x00000000 (EHADR)
CCSR 0x00005000 = 0xffe01001 (BR0)
CCSR 0x00005004 = 0xffe01c30 (OR0)
CCSR 0x00005008 = 0xff401001 (BR1)
CCSR 0x0000500c = 0xfff01814 (OR1)
CCSR 0x00005010 = 0xff501001 (BR2)
CCSR 0x00005014 = 0xfff01c20 (OR2)
CCSR 0x00005018 = 0xff600801 (BR3)
CCSR 0x0000501c = 0xfff01c10 (OR3)
CCSR 0x00005020 = 0xff300801 (BR4)
CCSR 0x00005024 = 0xfff01c10 (OR4)
CCSR 0x00005028 = 0xff701001 (BR5)
CCSR 0x0000502c = 0xfff01814 (OR5)
CCSR 0x00005030 = 0x00000000 (BR6)
CCSR 0x00005034 = 0x00000000 (OR6)
CCSR 0x00005038 = 0x00000000 (BR7)
CCSR 0x0000503c = 0x00000000 (OR7)
CCSR 0x000050b0 = 0x00080000 (LTESR)
CCSR 0x000050b4 = 0x00000000 (LTEDR)
CCSR 0x000050b8 = 0x00000000 (LTEIR)
CCSR 0x000050bc = 0x10120001 (LTEATR)
CCSR 0x000050c0 = 0xf81002e0 (LTEAR)
CCSR 0x00002e40 = 0x00000000 (1ERR_DETECT)
CCSR 0x00002e44 = 0x00000000 (1ERR_DISABLE)
CCSR 0x00002e4c = 0x00000000 (1CAPTURE_ATTR)
CCSR 0x00002e50 = 0x00000000 (1CAPTURE_ADDR)
CCSR 0x00002e54 = 0x00000000 (1CAPTURE_XADDR)
CCSR 0x00002e58 = 0x00ff0000 (1ERR_SBE)
CCSR 0x00006e40 = 0x00000000 (2ERR_DETECT)
CCSR 0x00006e44 = 0x00000000 (2ERR_DISABLE)
CCSR 0x00006e4c = 0x00000000 (2CAPTURE_ATTR)
CCSR 0x00006e50 = 0x00000000 (2CAPTURE_ADDR)
CCSR 0x00006e54 = 0x00000000 (2CAPTURE_XADDR)
CCSR 0x00006e58 = 0x00010000 (2ERR_SBE)
CCSR 0x000c0608 = 0x00000000 (LTLEDCSR)
CCSR 0x000c060c = 0x00000000 (LTLEECSR)
CCSR 0x00008020 = 0x00000000 (1PEX_PME_MES_DR)
CCSR 0x00008e00 = 0x00000000 (1PEX_ERR_DR)
CCSR 0x00008e20 = 0x00000000 (1PEX_ERR_CAP_STAT)
CCSR 0x00008e28 = 0x00000000 (1PEX_ERR_CAP_R0)
CCSR 0x00008e2c = 0x00000000 (1PEX_ERR_CAP_R1)
CCSR 0x00008e30 = 0x00000000 (1PEX_ERR_CAP_R2)
CCSR 0x00008e34 = 0x00000000 (1PEX_ERR_CAP_R3)
CCSR 0x00009020 = 0x00000000 (2PEX_PME_MES_DR)
CCSR 0x00009e00 = 0x00000000 (2PEX_ERR_DR)
CCSR 0x00009e20 = 0x00000000 (2PEX_ERR_CAP_STAT)
CCSR 0x00009e28 = 0x00000000 (2PEX_ERR_CAP_R0)
CCSR 0x00009e2c = 0x00000000 (2PEX_ERR_CAP_R1)
CCSR 0x00009e30 = 0x00000000 (2PEX_ERR_CAP_R2)
CCSR 0x00009e34 = 0x00000000 (2PEX_ERR_CAP_R3

These are the main register contents at the time of the crash - the first 32 values are GPR0 through 31, followed by CTR, LR, IAR, CR, spare1, spare2, VRSAVE

00000055 0effcff0 fe3200a8 485b4138 00000000 0000001b 00000040 00000000
00000000 301d000b 4853f4cc 4853f4d0 60eb0fdc 485bb3dc 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 20000088 0ff4c2e0 0ff4c000
48046014 fe3200a8 00141004 00000c8c 40000088 00000000 010a034c 0ff71388
00000000

As you can see we were at 0xc8c - early in the system call entry sequence when the machine check occurred. MSSR0 is indicating a TEA, but there are no indication from the lower level registers that I can see on why TEA was asserted. The EDR/EATR registers indicate a previously cleared TLB invalidation, but that's about it.

Here are the same for the USB1 machine

SPR 1009 = 0x00010c80 (HID1)

SPR 1014 = 0x00008000 (MSSCR0)

SPR 1015 = 0x00001000 (MSSSR0)

SPR 1011 = 0x00000000 (ICTRL)

SPR 1017 = 0xb0000000 (L2CR)

SPR 0988 = 0x00000000 (L2CAPTDATAHI)

SPR 0989 = 0x00000000 (L2CAPTDATALO)

SPR 0990 = 0x00000000 (L2CAPTECC)

SPR 0991 = 0x00000000 (L2ERRDET)

SPR 0992 = 0x00000000 (L2ERRDIS)

SPR 0993 = 0x00000000 (L2ERRINTEN)

SPR 0994 = 0x00000000 (L2ERRATTR)

SPR 0995 = 0x00000000 (L2ERRADDR)

SPR 0996 = 0x00000000 (L2ERREADDR)

SPR 0997 = 0x00000000 (L2ERRCTL)

SPR 0538 = 0x340007fe (DBAT1U)

SPR 0539 = 0xfc00002a (DBAT1L)

SPR 0540 = 0x300007fe (DBAT2U)

SPR 0541 = 0xf800002a (DBAT2L)

SPR 0542 = 0xfffc0003 (DBAT3U)

SPR 0543 = 0x00020011 (DBAT3L)

CCSR 0x00000c08 = 0x00000000 (LAWBAR0)

CCSR 0x00000c10 = 0x00000000 (LAWAR0)

CCSR 0x00000c28 = 0x00000000 (LAWBAR1)

CCSR 0x00000c30 = 0x80f0001c (LAWAR1)

CCSR 0x00000c48 = 0x00080000 (LAWBAR2)

CCSR 0x00000c50 = 0x8000001b (LAWAR2)

CCSR 0x00000c68 = 0x000c0000 (LAWBAR3)

CCSR 0x00000c70 = 0x80c0001b (LAWAR3)

CCSR 0x00000c88 = 0x000f8100 (LAWBAR4)

CCSR 0x00000c90 = 0x80400015 (LAWAR4)

CCSR 0x00000ca8 = 0x000e2000 (LAWBAR5)

CCSR 0x00000cb0 = 0x80000017 (LAWAR5)

CCSR 0x00000cc8 = 0x000e3000 (LAWBAR6)

CCSR 0x00000cd0 = 0x80100017 (LAWAR6)

CCSR 0x00000ce8 = 0x000fe000 (LAWBAR7)

CCSR 0x00000cf0 = 0x80400018 (LAWAR7)

CCSR 0x00000d08 = 0x00020000 (LAWBAR8)

CCSR 0x00000d10 = 0x8160001c (LAWAR8)

CCSR 0x00000d28 = 0x00090000 (LAWBAR9)

CCSR 0x00000d30 = 0x8010001b (LAWAR9)

CCSR 0x00001e00 = 0x00000001 (EDR)

CCSR 0x00001e08 = 0x00000000 (EER)

CCSR 0x00001e0c = 0x00104001 (EATR)

CCSR 0x00001e10 = 0x47f54358 (ELADR)

CCSR 0x00001e14 = 0x00000000 (EHADR)

CCSR 0x00005000 = 0xffe01001 (BR0)

CCSR 0x00005004 = 0xffe01c30 (OR0)

CCSR 0x00005008 = 0xff401001 (BR1)

CCSR 0x0000500c = 0xfff01814 (OR1)

CCSR 0x00005010 = 0xff501001 (BR2)

CCSR 0x00005014 = 0xfff01c20 (OR2)

CCSR 0x00005018 = 0xff600801 (BR3)

CCSR 0x0000501c = 0xfff01c10 (OR3)

CCSR 0x00005020 = 0xff300801 (BR4)

CCSR 0x00005024 = 0xfff01c10 (OR4)

CCSR 0x00005028 = 0xff701001 (BR5)

CCSR 0x0000502c = 0xfff01814 (OR5)

CCSR 0x00005030 = 0x00000000 (BR6)

CCSR 0x00005034 = 0x00000000 (OR6)

CCSR 0x00005038 = 0x00000000 (BR7)

CCSR 0x0000503c = 0x00000000 (OR7)

CCSR 0x000050b0 = 0x00080000 (LTESR)

CCSR 0x000050b4 = 0x00000000 (LTEDR)

CCSR 0x000050b8 = 0x00000000 (LTEIR)

CCSR 0x000050bc = 0x10100001 (LTEATR)

CCSR 0x000050c0 = 0xf8100260 (LTEAR)

CCSR 0x00002e40 = 0x00000000 (1ERR_DETECT)

CCSR 0x00002e44 = 0x00000000 (1ERR_DISABLE)

CCSR 0x00002e4c = 0x00000000 (1CAPTURE_ATTR)

CCSR 0x00002e50 = 0x00000000 (1CAPTURE_ADDR)

CCSR 0x00002e54 = 0x00000000 (1CAPTURE_XADDR)

CCSR 0x00002e58 = 0x00ff0000 (1ERR_SBE)

CCSR 0x00006e40 = 0x00000000 (2ERR_DETECT)

CCSR 0x00006e44 = 0x00000000 (2ERR_DISABLE)

CCSR 0x00006e4c = 0x00000000 (2CAPTURE_ATTR)

CCSR 0x00006e50 = 0x00000000 (2CAPTURE_ADDR)

CCSR 0x00006e54 = 0x00000000 (2CAPTURE_XADDR)

CCSR 0x00006e58 = 0x00010000 (2ERR_SBE)

CCSR 0x000c0608 = 0x00000000 (LTLEDCSR)

CCSR 0x000c060c = 0x00000000 (LTLEECSR)

CCSR 0x00008020 = 0x00000000 (1PEX_PME_MES_DR)

CCSR 0x00008e00 = 0x00000000 (1PEX_ERR_DR)

CCSR 0x00008e20 = 0x00000000 (1PEX_ERR_CAP_STAT)

CCSR 0x00008e28 = 0x00000000 (1PEX_ERR_CAP_R0)

CCSR 0x00008e2c = 0x00000000 (1PEX_ERR_CAP_R1)

CCSR 0x00008e30 = 0x00000000 (1PEX_ERR_CAP_R2)

CCSR 0x00008e34 = 0x00000000 (1PEX_ERR_CAP_R3)

CCSR 0x00009020 = 0x00000000 (2PEX_PME_MES_DR)

CCSR 0x00009e00 = 0x00000000 (2PEX_ERR_DR)

CCSR 0x00009e20 = 0x00000000 (2PEX_ERR_CAP_STAT)

CCSR 0x00009e28 = 0x00000000 (2PEX_ERR_CAP_R0)

CCSR 0x00009e2c = 0x00000000 (2PEX_ERR_CAP_R1)

CCSR 0x00009e30 = 0x00000000 (2PEX_ERR_CAP_R2)

CCSR 0x00009e34 = 0x00000000 (2PEX_ERR_CAP_R3)

and USB1's main register context

00000055 47f53f90 48545460 485b4138 00000000 0000001b 00000040 00000000

00000000 301d000b 4853f4cc 4853f4d0 5fe4f01b 485bb3dc 00000000 00000000

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000 00000000 20000088 00000000 0ff41710

48046014 fe3200a8 00141000 00000c24 20000088 00000000 00000000 00000000

00000000

In this case EDR/EATR are indicating a LAE failure on a read from core 0 (that's the core that took the machine check exception that caused the OS to go down as well).

This is occurring even early in the system call exception handling sequence (0xc24). The instructions that the OS has executed since the SC (a context synchronizing instruction according to the doc) are:

c00:       7f f0 43 a6     mtsprg  0,r31
c04:       7f d1 43 a6     mtsprg  1,r30
c08:       7f b2 43 a6     mtsprg  2,r29
c0c:       7f a0 00 26     mfcr    r29
c10:       3f e0 00 00     lis     r31,0
c14:       3b ff 00 00     addi    r31,r31,0
c18:       7f d3 42 a6     mfsprg  r30,3
c1c:       57 de 10 3a     rlwinm  r30,r30,2,0,29
c20:       7f fe f8 2e     lwzx    r31,r30,r31
c24:       90 3f 02 fc     stw     r1,764(r31)

At this point MSR[IR,DR] are both 0, but the EHADR/ELADR indicates the failing address is 0x47f54358, which looks like a virtual address, not physical. There's nothing in the registers that would be referencing that address, nor are their any instructions since the context synchronization point or following the crash point that would generate that address.

0 Kudos