I need help debugging an error captured by the CCF on the T1024

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

I need help debugging an error captured by the CCF on the T1024

1,569 Views
tiagobrusamarel
Contributor V

When booting Linux in a custom board with the T1024 SoC I got the following errors reported by the EDAC:

EDAC MC0: Giving out device to module MPC85xx_edac controller mpc85xx_mc_err: DEV mpc85xx_mc_err (INTERRUPT)
EDAC MPC85xx MC0: Err Detect Register: 0x80000005
EDAC MPC85xx MC0: Faulty Data bit: 15
EDAC MPC85xx MC0: Expected Data / ECC:    0xdeadbeef_deadbeef / 0xee
EDAC MPC85xx MC0: Captured Data / ECC:    0xdeadbeef_dead3eef / 0xee
EDAC MPC85xx MC0: Err addr: 0x29456600
EDAC MPC85xx MC0: PFN: 0x00029456
EDAC MC0: 1 CE mpc85xx_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x29456 offset:0x600 grain:8 syndrome:0xee)
EDAC MPC85xx MC0: Err Detect Register: 0x80000005
EDAC MPC85xx MC0: Faulty Data bit: 15
EDAC MPC85xx MC0: Expected Data / ECC:    0xf18ffebc_3d1afd33 / 0xf2
EDAC MPC85xx MC0: Captured Data / ECC:    0xf18ffebc_3d1a7d33 / 0xf2
EDAC MPC85xx MC0: Err addr: 0x2efabfd0
EDAC MPC85xx MC0: PFN: 0x0002efab
EDAC MC0: 1 CE mpc85xx_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x2efab offset:0xfd0 grain:8 syndrome:0xf2)
EDAC MPC85xx MC0: Err Detect Register: 0x80000005
EDAC MPC85xx MC0: Faulty Data bit: 15
EDAC MPC85xx MC0: Expected Data / ECC:    0x7fe4fb78_7fa3eb78 / 0x16
EDAC MPC85xx MC0: Captured Data / ECC:    0x7fe4fb78_7fa36b78 / 0x16
EDAC MPC85xx MC0: Err addr: 0x00315980
EDAC MPC85xx MC0: PFN: 0x00000315
EDAC MC0: 1 CE mpc85xx_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x315 offset:0x980 grain:8 syndrome:0x16)
EDAC MPC85xx MC0: Err Detect Register: 0x80000005
EDAC MPC85xx MC0: Faulty Data bit: 15
EDAC MPC85xx MC0: Expected Data / ECC:    0x7fe4fb78_7fa3eb78 / 0x16
EDAC MPC85xx MC0: Captured Data / ECC:    0x7fe4fb78_7fa36b78 / 0x16
EDAC MPC85xx MC0: Err addr: 0xffe043910
EDAC MPC85xx MC0: PFN: 0x00ffe043
EDAC MPC85xx MC0: PFN out of range!

At this moment I connected to the target with CodeWarrior to collect a dump of all SoC registers in order to analyze them and understand what was the cause of this issue. I noticed the following values from the CCF registers:

<register name="CCM_CESR">
<value>0x80000001</value>
<location>0xffe018e40`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CEDDR">
<value>0x00000000</value>
<location>0xffe018e44`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CEIER">
<value>0x00000000</value>
<location>0xffe018e48`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CECAR">
<value>0x01240000</value>
<location>0xffe018e4c`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CECADRH">
<value>0x00000001</value>
<location>0xffe018e50`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CECADRL">
<value>0x8097C210</value>
<location>0xffe018e54`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>
<register name="CCM_CECA2R">
<value>0x00000010</value>
<location>0xffe018e58`Physical cache - inhibited</location>
<custom-groups></custom-groups>
</register>

They seem to indicate that an error was detected and reported. Then I tried to decode those values:

CCM_CESR = 0x80000001 -> Error captured (CAP=1b), Local Access Error (CETYPE=00000b), Local Access Error Detected (LAEDET=1b)
CCM_CECAR = 0x01240000 -> SRC_ID_SRC_GROUP=1001001b <--> This value seem to be reserved
CCM_CECADRH = 0x00000001 -> ADDRH = 0x1
CCM_CECADRL = 0x8097C210 -> ADDRL = 0x8097C210
CCM_CECA2R = 0x00000010 -> I could not find any info about this register in the reference manual

Is there someone that know how CCF works and could explain what could be the next steps in my investigation?

0 Kudos
1 Reply

1,060 Views
yipingwang
NXP TechSupport
NXP TechSupport

Hello Tiago Brusamarello,

Please refer to mpc85xx_mc_check in Linux Kernel source code drivers/edac/mpc85xx_edac.c, the above error message is printed with the following code.
        /*
         * Analyze single-bit errors on 64-bit wide buses
         * TODO: Add support for 32-bit wide buses
         */
        if ((err_detect & DDR_EDE_SBE) && (bus_width == 64)) {
                sbe_ecc_decode(cap_high, cap_low, syndrome,
                                &bad_data_bit, &bad_ecc_bit);

                if (bad_data_bit != -1)
                        mpc85xx_mc_printk(mci, KERN_ERR,
                                "Faulty Data bit: %d\n", bad_data_bit);
                if (bad_ecc_bit != -1)
                        mpc85xx_mc_printk(mci, KERN_ERR,
                                "Faulty ECC bit: %d\n", bad_ecc_bit);

                mpc85xx_mc_printk(mci, KERN_ERR,
                        "Expected Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
                        cap_high ^ (1 << (bad_data_bit - 32)),
                        cap_low ^ (1 << bad_data_bit),
                        syndrome ^ (1 << bad_ecc_bit));
        }

According to the above code, DDR single-bit ECC error is detected, faulty Data at bit 15.

Would you please refer to the section "11.3 Hardware diagnostics" in CodeWarrior user manual C:\Freescale\CW_PA_v10.5.1_new\PA\Help\PDF\Targeting_PA_Processors.pdf to do DDR memory hardware diagnostics?

Please use DDRv tool installed on the top of QCVS tool in CodeWarrior IDE to tune and optimize DDR controller configuration parameters and perform more validation.


Have a great day,
TIC

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos