Hi Folks
I am using a linux-qoriq 5.10 kernel from freescale on a T2080RDB with an AMD e8860 GPU. I am trying to use the AMD opensource driver (radeon and/or amdgpu drm driver). I did some big endian patches and hit an issue with the IOMMU. Here are the symptoms:
On modprobe, the driver tests the GPU ring command. It will fill a command buffer in system memory, then the GPU fetches the buffer and executes it. As a result a scratch register of the GPU is updated and the GPU increases its read pointer into the ring.
Without fsl_pamu enabled (the Linux PAMU driver), this test works fine without triggering any access error, which indicates all translation addresses are set up properly (outbound, inbound and law).
But with fsl_pamu enabled, this test fails. The command buffer isn't executed, neither the scratch value nor the read pointer into the ring are updated. Moreover, the fsl_pamu driver doesn't trigger any access error.
I dig a little further and found something strange. The AMD e8860 GPU supports PCIe AER. With fsl_pamu enabled, after the ring test, the GPU will log a TLP error with the following header :
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 0c, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 4a004010 00000040 11002000 00000000
As you can see, the header log tells us:
0x4a004010: fmt 0x2 / type 0xa -> completion with data - Ep 1 -> poisoned TLP - Length 0x10
0x00000040: Completer ID 0x0000 (my PCIe root port on the T2080) - Byte count : 0x40
0x11002000: Requester ID 0x1100 (my AMD e8860 GPU 11:0.0) - Tag 0x20 - Lower address 0x00
0x00000000: Data 0x0
So due to the EP bit set (poisoned TLP), this is logged as a bad TLP and the GPU seems to silently discard it (no errors reported through EDAC layer). Usually EP bit is set when there is an ECRC error due to link issue but without fsl_pamu, I don’t have any errors which let me thinks the physical integrity of the link is not the cause.
Any ideas ?
Thomas
Please help to check above test results first.
Hi
Sorry for late answer, summer holidays !
The GPU is behind a PCIe switch. Behind the same PCIe switch, alongside the GPU, there is a V4L2 video device which also performs DMA to/from system memory. And this v4l2 has no issue with its DMA.
Also I tested multiple kernel versions, from 4.14 to 5.10 (LSDK or not) and still get the same issue.
Unfortunately I don't have another GPU to test with...
What the cmd buffer address is for your GPU access?
Can you try this GPU on T2080RDB board?