T2080 PCIe transfer error with PAMU enabled

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

T2080 PCIe transfer error with PAMU enabled

1,244 次查看
Toumou
Contributor I

Hi Folks

 

I am using a linux-qoriq 5.10 kernel from freescale on a T2080RDB with an AMD e8860 GPU. I am trying to use the AMD opensource driver (radeon and/or amdgpu drm driver). I did some big endian patches and hit an issue with the IOMMU. Here are the symptoms:

 

On modprobe, the driver tests the GPU ring command. It will fill a command buffer in system memory, then the GPU fetches the buffer and executes it. As a result a scratch register of the GPU is updated and the GPU increases its read pointer into the ring.

 

Without fsl_pamu enabled (the Linux PAMU driver), this test works fine without triggering any access error, which indicates all translation addresses are set up properly (outbound, inbound and law).

 

But with fsl_pamu enabled, this test fails. The command buffer isn't executed, neither the scratch value nor the read pointer into the ring are updated. Moreover, the fsl_pamu driver doesn't trigger any access error.

 

I dig a little further and found something strange. The AMD e8860 GPU supports PCIe AER. With fsl_pamu enabled, after the ring test, the GPU will log a TLP error with the following header :

 

Capabilities: [150 v2] Advanced Error Reporting

UESta: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+

AERCap: First Error Pointer: 0c, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-

MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-

HeaderLog: 4a004010 00000040 11002000 00000000

 

As you can see, the header log tells us:

0x4a004010: fmt 0x2 / type 0xa -> completion with data - Ep 1 -> poisoned TLP - Length 0x10

0x00000040: Completer ID 0x0000 (my PCIe root port on the T2080) - Byte count : 0x40

0x11002000: Requester ID 0x1100 (my AMD e8860 GPU 11:0.0) - Tag 0x20 - Lower address 0x00

0x00000000: Data 0x0

 

So due to the EP bit set (poisoned TLP), this is logged as a bad TLP and the GPU seems to silently discard it (no errors reported through EDAC layer). Usually EP bit is set when there is an ECRC error due to link issue but without fsl_pamu, I don’t have any errors which let me thinks the physical integrity of the link is not the cause.

 

Any ideas ?

Thomas

0 项奖励
回复
3 回复数

1,224 次查看
yipingwang
NXP TechSupport
NXP TechSupport
  1. AFAIK, T2080RDB has not been tested under 5.10 kernel, so could you check it with LSDK-2012 or yocto 3.1 release for verification?
  2. Does it work after changing the GPU card to other PCIe card such as e1000?

 

Please help to check above test results first.

0 项奖励
回复

1,201 次查看
Toumou
Contributor I

Hi

 

Sorry for late answer, summer holidays !

 

The GPU is behind a PCIe switch. Behind the same PCIe switch, alongside the GPU, there is a V4L2 video device which also performs DMA to/from system memory. And this v4l2 has no issue with its DMA.

 

Also I tested multiple kernel versions, from 4.14 to 5.10 (LSDK or not) and still get the same issue.

 

Unfortunately I don't have another GPU to test with...

0 项奖励
回复

1,157 次查看
yipingwang
NXP TechSupport
NXP TechSupport

What the cmd buffer address is for your GPU access?

Can you try this GPU on T2080RDB board?

0 项奖励
回复