LX2160A RDB + PCIe cache coherency issue

KevinM · ‎09-27-2023

Hello,

I believe I'm having a cache coherency issue between my LX2160A RDB and my PCIe device. The PCIe device is a cryptographic accelerator ASIC called "Virgo" that we're developing in house. The LX2160A is the root complex.

The host sends requests to the Virgo and it responds. The requests are sent using DMA. The Virgo driver will prepare a chain of descriptor tables for the Virgo to read. Each descriptor table (DT) corresponds to a single request for the Virgo to process. The DT contains an end-of-chain (EOC) bit to indicate if the request is valid or if it's a dummy DT at the EOC. The Virgo should only process a DT if the EOC bit is 0. This whole process is very similar to network cards using DMA and descriptors to send/receive packets.

The driver does the following process when appending a DT to the chain.

1. Allocate a new DT which will become the new dummy DT at the EOC.

2. Update the old dummy EOC DT with real request data and to point to the new dummy EOC DT. The driver is careful to leave EOC=1 while updating the DT.

3. The driver does a wmb() to ensure that all these changes happen before the next step.

4. The driver updates EOC to 0. At this point, the DT contains a valid request and can be processed by the Virgo.

5. The driver hits a "refetch" bit on the Virgo to indicate that a new DT is available.

Note that the Virgo can see the new request at any point because it's just processing requests in the chain until it hits the EOC DT.

The size of a DT is 96 bytes so it doesn't fit in a single cache line (64 bytes). I'm careful to allocate the memory in such a way that a DT gets 2 cache lines to itself. The issue I'm seeing is that sometimes the Virgo will see good data in the first 64 bytes and bad data in the last 32 bytes. The Virgo will see a DT where EOC=0 (indicating that the DT is ready to be processed) but the last 32 bytes are still the dummy DT. This leads to the Virgo choking on this bad DT. The driver is very careful to update EOC=0 last so it should not be possible for the Virgo to see a DT where EOC=0 and parts of the DT are still the dummy values programmed in the dummy EOC DT. This makes me think it's some kind of cache coherency issue.

I hooked up a PCIe analyzer to see if anything was wrong there. The Virgo is sending a single read request for 96 bytes. The NO_SNOOP bit is 0 and the relaxed ordering bit is 0. The root complex (LX2160A) is responding with a single completion but the data contains the bad DT. I don't think there's anything wrong happening at the PCIe level. I could see this issue being caused by the Virgo issuing multiple read requests to read a single DT but that's not the case here. I've captured several bad DTs with the PCIe analyzer and they all follow the same pattern: the first 64bytes are good and the last 32bytes contain dummy DT values.

The Virgo supports SR-IOV. My test case involves stress testing 4 VFs with crypto requests. The issue happens quite frequently and it takes less than 5mins to reproduce. I'm currently working around the issue by commenting out "dma-coherent" in the device tree. This causes the DTs to be allocated in non-cacheable memory and I don't see the problem anymore. We have high performance requirements in our product (100,000 operations/second) so marking the PCIe devices as NOT dma-coherent isn't a good solution long term. I would like to take advantage of the whole CCN architecture so that incoming PCIe reads can just snoop the data from CPU caches.

I'm at a loss of how to proceed from here. I've tried many different combinations of barriers (wmb() and dma_wmb()) when writing the DT and nothing helps. I've compared my driver to several network drivers that also work with DMA descriptors and I believe my code is correct. I can manually clean the cache lines where the DT resides before writing EOC=0 to solve the issue but again, this isn't ideal. The CCN architecture should allow the PCIe controller to snoop the CPU caches.

I attached the output of lspci -vvv in case there are any hints there. I also attached my kernel config because I had to modify some of the IOMMU configs to have a stronger isolation between the VFs. In my use case, it's essential that the VFs can't access each others memory on the host.

Any help on what the issue might be would be greatly appreciated,

Kevin