Hi,
When we run LS1046A custom board for long time, we see random errors. Mostly seen during overnight run. There is no impact on the performance.
pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
pcieport 0002:00:00.0: [ 0] Receiver Error (First)
Would be helpful if someone can point if these AER errors can be related to Signal Integrity or Electrical Margins on the PCIe Interface.
Thank you
Sabir
Yiping,
We did implement the above erratum too and still we are seeing the issue. What else we can look here . Should we check anything on link partner side . I am just summarizing the testing and some analysis we have done so far.
1. Implemented all pcie related erratas you had asked: LNaSSCR1[RXEQ_BST_1],ASPM_CTL off , A-008851
2. Gen2 testing we didn't hit the issue
3. Gen3 we are seeing only in x2 lanes and not on x1 lane. Connectivity was provided in earlier communication.
4. Captured eye close to BGA on LS1046a RX wrt the x2 lanes. It looks better. I am attaching the scope shots of the same since it wasn't shared earlier.
Below are the logs of failure.As timestamped below, it has occurred thrice in overnight testing. Time mentioned is in seconds.
[85993.664288] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[85993.671900] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[85993.682091] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[85993.690441] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[89880.073959] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[89880.081574] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[89880.091758] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[89880.100110] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[101222.238253] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[101222.245955] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[101222.256244] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000041/00006000
[101222.264685] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
[101222.271563] pcieport 0002:00:00.0: [ 6] Bad TLP
[102005.641374] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[102005.649060] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[102005.659348] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000001/00006000
[102005.667789] pcieport 0002:00:00.0: [ 0] Receiver Error (First)
I would like to know will there be any through put degradation due to this issue or any functional related issue that we need to worry about.
--
Krish
Hello Sabir,
The log file is pointing to some possible correctable error detected by our PEX2 RC as the receiver. As the receiver, some correctable error were just detected, which could come from either the board level signal integrity or the EP device. If you want to fix the problem, you need to hook up the PCIe protocol analyzer to capture some trace and narrow down the problem.
Correctable Error Register [0, RXE = Receiver Error]is somewhat related to board level signal integrity or noise. It doesn’t seem like causing harm, unless it’s proven by either error register dump in our AER block or PCIe protocol analyzer trace. Unless further information is provided (for example, raw trace file), we cannot help.
Thanks,
Yiping
Yiping,
I am working with Sabir on this issue. While we can share the error register dump during error, we would ,like to know the offset of the same. There are correctable error status register , 0x110 and device status register 0x7a. But all these are top level register. I couldn't find the way to dump error trace via software
Similar to this we saw the PCie RX erorr with bad TLP bit set too.
[188134.083503] pcieport 0002:00:00.0: AER: Multiple Corrected error received: id=0000
[188134.091206] pcieport 0002:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[188134.101513] pcieport 0002:00:00.0: device [1957:81c1] error status/mask=00000041/00006000
--
Krish
When the Physical Layer logic detects an error, it sets the Receiver Error Status bit in the Correctable Error Status register of the PCI Express Extended Advanced Error Capabilities register set. To say more on this issue, please provide the following:
1) Which LSDK is using?
2) From the log, it seems you are using PCIe controller 3. Is it at gen 1/2/3 and x1/2 ?
3) Are they doing performance test overnight or just leave the system idle?
4) Provide PCIe and SerDes registers whole dump when error occurs.
5) If possible, please share the picture/block diagram of your setup.
6) Provide RCW and PBI as well.
7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?
8) How many and in how much interval these errors come?
9) If you clear this error bit and then run the transaction , does this bit set immediately or after some time?
10) What clocking scheme are you using?
11) Are PLL filters of SerDes designed as per the design checklist?
12) Are you using spread spectrum clocking?
Yiping ,
Below is my answer
1) Which LSDK is using?
Krish: LSDK-19.09-V4.14
2) From the log, it seems you are using PCIe controller 3. Is it at gen 1/2/3 and x1/2 ?
Krish : Yeah , PCIE 3.0. One link is x1 and other link is x2. At present we are seeing a x2 lanes. X1 lane isn’t under test. We are planning to test
3) Are they doing performance test overnight or just leave the system idle?
Krish: Its overnight long hours testing
4) Provide PCIe and SerDes registers whole dump when error occurs.
Krish: Entire pcie and serdes register set or anything specific ? Errors appear and gets cleared and appears after few hours. So is it ok to take the dump post last error seen and even if it gets cleared ?
5) If possible, please share the picture/block diagram of your setup.
6) Provide RCW and PBI as well.
Do you need bin file ?
7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?
Krish : We have seen at gen3, not tested at gen2 yet.
8) How many and in how much interval these errors come?
Krish : In 12 hour of testing , we saw 2-3 times and there isn’t consistent interval between them. Shall check that once.
9) If you clear this error bit and then run the transaction , does this bit set immediately or after some time?
Krish: It takes some time
10) What clocking scheme are you using?
Krish : RC and EP both has ref clock for TX , data clock architecture
11) Are PLL filters of SerDes designed as per the design checklist?
Krish : Yes we have followed, shall check one more time.
12) Are you using spread spectrum clocking?
Krish : We aren’t using
PS: I just filed a ticket in NXP portal to have this issue tracked.
--
Krish
4) Provide PCIe and SerDes registers the whole dump when error occurs.
Krish: Entire pcie and serdes register set or anything specific ? Errors appear and gets cleared and appears after few hours. So is it ok to take the dump post last error seen and even if it gets cleared ?
[NXP]: Please provide the entire PCIe and SerDes register dump. We would like to see if any other error occurs when the receiver error comes.
6) Provide RCW and PBI as well.
Do you need bin file ?
[NXP]: Yes, please share the bin file.
7) Is the issue persistent with gen1 or gen2 or gen3 only or in all cases?
Krish : We have seen at gen3, not tested at gen2 yet.
[NXP]: Try your test with gen2 and gen1. Let us know whether an error occurs or not.
Good to know that the issue didn't come on x1 gen3 yet.
Below are the ranges for register dump:
For SerDes: 0x1EB00000 - 0x1EB0199C
For PCIe: 0x360000 - 0x3601038 and PEX_LUT: 0x3680000 - 0x36C07FC
To say more on this issue, we expect the customer to provide the other information as well which were asked before.
Yiping,
I am attaching the log. All the register set log is in one file. In earlier response I had sent few of those. You can discard that and consider this file.
Register set range
PEX1 LUT,PEX1 PCIe
PEX2 LUT, PEX2 PCIe
Serdes 2 register set
Please note we saw the error wrt PEX2 controller and corresponding Serdes 2 lanes.
--
KRish
Here are my comments:
1) LNaSSCR1[RXEQ_BST_1] is set as per given SerDes register dump. For PCI Express Gen3 (8.0 GT/s), it is recommended to set Rx Equalization Boost bit for all the lanes in use to 0b during the Pre-boot Initialization (PBI) stage. Default value is not good for normal channel loss. Unless the customer has done the simulation of their channel and determined that the channel is in high loss condition, then only setting of LNaSSCR1[RXEQ_BST_1] bit allowed. Refer "Optimal setting for the SerDes channel Rx Equalization Boost bit" from LS1046A design checklist (AN5252) for more details.
2) Please set Active State Power Management (ASPM) Control (ASPM_CTL) bit as 1'b00 of Link partner.
3) "Try your test with gen2 and gen1. Let us know whether error occurs or not." Any update on this.
4) A-008851: "Invalid transmitter/receiver preset values are used in Gen3 equalization phases during link training for RC mode" workaround is not applied in PBI. Please apply the workaround as mentioned in LS1046A Chip errata and check other PCIe related errata as well.
Yiping,
It took some time for us to implement and also test. Test is underway. So my response got delayed.
We were able to implement 1 and 4 with RCW change. Test is in progress.
Regarding 2, Active state power management , how to implement. Can you please elaborate more on this ? What would be the impact of not implementing this.
Regarding 3, we did run at Gen2 mode, we aren't hitting the issue.
--
Krish
Yiping,
Can you please provide your feedback.
Also the test we did with two errata's LNaSSCR1[RXEQ_BST_1] and A-008851: being implemented, we are seeing the errors. I am attaching the register dump. I see that errata's are fixed. But still issue is seen.
So is "Please set Active State Power Management (ASPM) Control (ASPM_CTL) bit as 1'b00 of Link partner" is critical to this , I am looking for your input on how to implement this ? Should this is be on Link partner side or NXP side ?
We are in critical stage of release. Your expertise is valuable , please respond.
--
Krish
There is an erratum A-010053 regarding ASPM in LS1046ACE. Its workaround is to disable ASPM through PCI Express link control register [ASPM_CTL].
Yiping,
Sorry for the late response. We have been testing the other PCIe lane, x1 gen3 to see if error occurs . So far we didn’t hit the issue. So moving back the set up to test other lanes (x2). I would share the register dump as soon as error is observed. In the mean time I would like to get confirmation if below register set is enough or anything else need to be considered.
As mentioned earlier, we have used PCIe lanes to Serdes2 and PEX2 controller.
1. Serdes Registers : 0x1EB00000 - 0x1EB0199C
2. PCIe register set : 0x3600000 – 0x3601038
--
Krish