LS1046A AER random errors - Corrected Errors

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

LS1046A AER random errors - Corrected Errors

6,366 Views
sabirsyed
Contributor I

Hi,

When we run LS1046A custom board for long time, we see random errors. Mostly seen during overnight run. There is no impact on the performance. 

pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
pcieport 0002:00:00.0: [ 0] Receiver Error (First)

Would be helpful if someone can point if these AER errors can be related to Signal Integrity or Electrical Margins on the PCIe  Interface.

Thank you

Sabir

0 Kudos
Reply
15 Replies

5,951 Views
srinivas_hk
Contributor I

I am attaching the files as mentioned in my earlier ticket. Please do provide your inputs

0 Kudos
Reply

6,033 Views
srinivas_hk
Contributor I

Yiping,

We did implement the above erratum too and still we are seeing the issue. What else we can look here . Should we check anything on link partner side . I am just summarizing the testing and some analysis we have done so far.

1. Implemented all pcie related erratas you had asked: LNaSSCR1[RXEQ_BST_1],ASPM_CTL off , A-008851

2. Gen2 testing we didn't hit the issue

3. Gen3 we are seeing only in x2 lanes and not on x1 lane. Connectivity was provided in earlier communication.

4. Captured eye close to BGA on LS1046a RX wrt the x2 lanes. It looks better. I am attaching the scope shots of the same since it wasn't shared earlier.

Below are the logs of failure.As timestamped below, it has occurred thrice in overnight testing. Time mentioned is in seconds.

[85993.664288] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[85993.671900] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[85993.682091] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[85993.690441] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[89880.073959] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[89880.081574] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[89880.091758] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[89880.100110] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[101222.238253] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[101222.245955] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[101222.256244] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000041/00006000
[101222.264685] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[101222.271563] pcieport 0002:00:00.0: [ 6] Bad TLP
[102005.641374] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[102005.649060] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[102005.659348] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[102005.667789] pcieport 0002:00:00.0: [ 0] Receiver Error (First)

I would like to know will there be any through put degradation due to this issue or any functional related issue that we need to worry about.

--

Krish

0 Kudos
Reply

6,033 Views
yipingwang
NXP TechSupport
NXP TechSupport

Hello Sabir,

The log file is pointing to some possible correctable error detected by our PEX2 RC as the receiver. As the receiver, some correctable error were just detected, which could come from either the board level signal integrity or the EP device. If you want to fix the problem, you need to hook up the PCIe protocol analyzer to capture some trace and narrow down the problem.


Correctable Error Register [0, RXE = Receiver Error]is somewhat related to board level signal integrity or noise. It doesn’t seem like causing harm, unless it’s proven by either error register dump in our AER block or PCIe protocol analyzer trace. Unless further information is provided (for example, raw trace file), we cannot help.

Thanks,

Yiping

0 Kudos
Reply

6,033 Views
srinivas_hk
Contributor I

Yiping,

I am working with Sabir on this issue. While we can share the error register dump during error, we would ,like to know the offset of the same. There are correctable error status register , 0x110 and device status register 0x7a. But all these are top level register.  I couldn't find the way to dump error trace via software

Similar to this we saw the PCie RX erorr with  bad TLP  bit set too.

[188134.083503] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[188134.091206] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[188134.101513] pcieport 0002:00:00.0:   device [1957:81c1] error status/mask=00000041/00006000

--

Krish

0 Kudos
Reply

6,033 Views
yipingwang
NXP TechSupport
NXP TechSupport

When the Physical Layer logic detects an error, it sets the Receiver Error Status bit in the Correctable Error Status register of the PCI Express Extended Advanced Error Capabilities register set. To say more on this issue, please provide the following:

1) Which LSDK is using?
2) From the log, it seems you are using PCIe controller 3. Is it at gen 1/2/3 and x1/2 ?
3) Are they doing performance test overnight or just leave the system idle?
4) Provide PCIe and SerDes registers whole dump when error occurs.
5) If possible, please share the picture/block diagram of your setup.
6) Provide RCW and PBI as well.
7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?
8) How many and in how much interval these errors come?
9) If you clear this error bit and then run the transaction , does this bit set immediately or after some time?
10) What clocking scheme are you using?
11) Are PLL filters of SerDes designed as per the design checklist?
12) Are you using spread spectrum clocking?

0 Kudos
Reply

6,033 Views
srinivas_hk
Contributor I

Yiping ,

Below is my answer

1) Which LSDK is using?

Krish: LSDK-19.09-V4.14

2) From the log, it seems you are using PCIe controller 3. Is it at gen 1/2/3 and x1/2 ?

Krish : Yeah , PCIE 3.0. One link is x1 and other link is x2. At present we are seeing a x2 lanes. X1 lane isn’t under test. We are planning to test

3) Are they doing performance test overnight or just leave the system idle?

Krish: Its overnight long hours testing

4) Provide PCIe and SerDes registers whole dump when error occurs.

Krish: Entire pcie and serdes register set or anything specific ? Errors appear and gets cleared and appears after few hours. So is it ok to take the dump post last error seen and even if it gets cleared ?

5) If possible, please share the picture/block diagram of your setup.

6) Provide RCW and PBI as well.

Do you need bin file ?

7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?

Krish : We have seen at gen3, not tested at gen2 yet.

8) How many and in how much interval these errors come?

Krish : In 12 hour of testing , we saw 2-3 times and there isn’t consistent interval between them. Shall check that once.

9) If you clear this error bit and then run the transaction , does this bit set immediately or after some time?

Krish: It takes some time

10) What clocking scheme are you using?

Krish : RC and EP both has ref clock for TX , data clock architecture

11) Are PLL filters of SerDes designed as per the design checklist?

Krish : Yes we have followed, shall check one more time.

12) Are you using spread spectrum clocking?

Krish : We aren’t using

PS: I just filed a ticket in NXP portal to have this issue tracked.

--

Krish

0 Kudos
Reply

6,033 Views
yipingwang
NXP TechSupport
NXP TechSupport

4) Provide PCIe and SerDes registers the whole dump when error occurs.

Krish: Entire pcie and serdes register set or anything specific ?  Errors appear and gets cleared and appears after few hours. So is it ok to take the dump post last error seen and even if it gets cleared ?

 

[NXP]: Please provide the entire PCIe and SerDes register dump. We would like to see if any other error occurs when the receiver error comes.  

 

 

6) Provide RCW and PBI as well.

Do you need bin file ?

 

[NXP]: Yes, please share the bin file.

 

 

7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?

Krish : We have seen at gen3, not tested at gen2  yet.

 

[NXP]: Try your test with gen2 and gen1. Let us know whether an error occurs or not.

0 Kudos
Reply

6,034 Views
srinivas_hk
Contributor I

Yiping,

I am attaching the rest of the files { PCIE and Serdes register dump post error) and RCW bin file

--

Krish

0 Kudos
Reply

6,035 Views
yipingwang
NXP TechSupport
NXP TechSupport

Good to know that the issue didn't come on x1 gen3 yet.

 

Below are the ranges for register dump:

For SerDes: 0x1EB00000 - 0x1EB0199C

For PCIe: 0x360000 - 0x3601038 and PEX_LUT: 0x3680000 - 0x36C07FC

 

To say more on this issue, we expect the customer to provide the other information as well which were asked before.

0 Kudos
Reply

6,035 Views
srinivas_hk
Contributor I

Yiping,

I am attaching the log. All the register set log is in one file. In earlier response I had sent few of those. You can discard that and consider this file.

Register set range

PEX1 LUT,PEX1 PCIe

PEX2 LUT, PEX2 PCIe

Serdes 2 register set

Please note we saw the error wrt PEX2 controller and corresponding Serdes 2 lanes.

--

KRish

0 Kudos
Reply

6,035 Views
yipingwang
NXP TechSupport
NXP TechSupport

Here are my comments:

 

1) LNaSSCR1[RXEQ_BST_1] is set as per given SerDes register dump. For PCI Express Gen3 (8.0 GT/s), it is recommended to set Rx Equalization Boost bit for all the lanes in use to 0b during the Pre-boot Initialization (PBI) stage. Default value is not good for normal channel loss. Unless the customer has done the simulation of their channel and determined that the channel is in high loss condition, then only setting of LNaSSCR1[RXEQ_BST_1] bit allowed. Refer "Optimal setting for the SerDes channel Rx Equalization Boost bit" from LS1046A design checklist (AN5252) for more details.

 

2) Please set Active State Power Management (ASPM) Control (ASPM_CTL) bit as 1'b00 of Link partner.

 

3) "Try your test with gen2 and gen1. Let us know whether error occurs or not." Any update on this.

 

4) A-008851: "Invalid transmitter/receiver preset values are used in Gen3 equalization phases during link training for RC mode" workaround is not applied in PBI. Please apply the workaround as mentioned in LS1046A Chip errata and check other PCIe related errata as well.

0 Kudos
Reply

6,035 Views
srinivas_hk
Contributor I

Yiping,

It took some time for us to implement and also test. Test is underway.  So my response got delayed.

We were able to implement 1 and 4 with RCW change. Test is in progress.

Regarding 2, Active state power management , how to implement. Can you please elaborate more on this ? What would be the impact of not implementing this.

Regarding 3, we did run at Gen2 mode, we aren't hitting the issue.

--

Krish

0 Kudos
Reply

6,035 Views
srinivas_hk
Contributor I

Yiping,

Can you please provide your feedback.

Also the test we did with two errata's LNaSSCR1[RXEQ_BST_1] and  A-008851: being implemented, we are seeing the errors. I am attaching the register dump. I see that errata's are fixed. But still issue is seen.

So is  "Please set Active State Power Management (ASPM) Control (ASPM_CTL) bit as 1'b00 of Link partner" is critical to this , I am looking for your input on how to implement this ?  Should this is be on Link partner side or NXP side ?

We are in critical stage of release. Your expertise is valuable , please respond.

--

Krish

0 Kudos
Reply

6,034 Views
yipingwang
NXP TechSupport
NXP TechSupport

There is an erratum A-010053 regarding ASPM in LS1046ACE. Its workaround is to disable ASPM through PCI Express link control register [ASPM_CTL].

0 Kudos
Reply

6,033 Views
srinivas_hk
Contributor I

Yiping,

Sorry for the late response. We have been testing the other PCIe lane, x1 gen3 to see if error occurs . So far we didn’t hit the issue. So moving back the set up to test other lanes (x2). I would share the register dump as soon as error is observed. In the mean time I would like to get confirmation if below register set is enough or anything else need to be considered.

As mentioned earlier, we have used PCIe lanes to Serdes2 and PEX2 controller.

1. Serdes Registers : 0x1EB00000 - 0x1EB0199C

2. PCIe register set : 0x3600000 – 0x3601038

--

Krish

0 Kudos
Reply