T2080 PCIe transfer error with PAMU enabled

Toumou · ‎07-20-2021

Hi Folks

I am using a linux-qoriq 5.10 kernel from freescale on a T2080RDB with an AMD e8860 GPU. I am trying to use the AMD opensource driver (radeon and/or amdgpu drm driver). I did some big endian patches and hit an issue with the IOMMU. Here are the symptoms:

On modprobe, the driver tests the GPU ring command. It will fill a command buffer in system memory, then the GPU fetches the buffer and executes it. As a result a scratch register of the GPU is updated and the GPU increases its read pointer into the ring.

Without fsl_pamu enabled (the Linux PAMU driver), this test works fine without triggering any access error, which indicates all translation addresses are set up properly (outbound, inbound and law).

But with fsl_pamu enabled, this test fails. The command buffer isn't executed, neither the scratch value nor the read pointer into the ring are updated. Moreover, the fsl_pamu driver doesn't trigger any access error.

I dig a little further and found something strange. The AMD e8860 GPU supports PCIe AER. With fsl_pamu enabled, after the ring test, the GPU will log a TLP error with the following header :

Capabilities: [150 v2] Advanced Error Reporting

UESta: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+

AERCap: First Error Pointer: 0c, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-

MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-

HeaderLog: 4a004010 00000040 11002000 00000000

As you can see, the header log tells us:

0x4a004010: fmt 0x2 / type 0xa -> completion with data - Ep 1 -> poisoned TLP - Length 0x10

0x00000040: Completer ID 0x0000 (my PCIe root port on the T2080) - Byte count : 0x40

0x11002000: Requester ID 0x1100 (my AMD e8860 GPU 11:0.0) - Tag 0x20 - Lower address 0x00

0x00000000: Data 0x0

So due to the EP bit set (poisoned TLP), this is logged as a bad TLP and the GPU seems to silently discard it (no errors reported through EDAC layer). Usually EP bit is set when there is an ECRC error due to link issue but without fsl_pamu, I don’t have any errors which let me thinks the physical integrity of the link is not the cause.

Any ideas ?

Thomas

yipingwang · ‎07-23-2021

AFAIK, T2080RDB has not been tested under 5.10 kernel, so could you check it with LSDK-2012 or yocto 3.1 release for verification?
Does it work after changing the GPU card to other PCIe card such as e1000?

Please help to check above test results first.

Toumou · ‎08-09-2021

Hi

Sorry for late answer, summer holidays !

The GPU is behind a PCIe switch. Behind the same PCIe switch, alongside the GPU, there is a V4L2 video device which also performs DMA to/from system memory. And this v4l2 has no issue with its DMA.

Also I tested multiple kernel versions, from 4.14 to 5.10 (LSDK or not) and still get the same issue.

Unfortunately I don't have another GPU to test with...

yipingwang · ‎08-24-2021

What the cmd buffer address is for your GPU access?

Can you try this GPU on T2080RDB board?