Hello,
I am using an i.MX6qdl processor on a customer platform. An FPGA is connected to the PCIe interface to communicate with the internal IP's, using legacy interrupts. We figured out, that when running long-term tests sometimes the interrupt mapping seems to be changed for some reason. The interrupt of the FPGA is mapped to irq 300 after booting. After starting the long-term tests do perform traffic on the PCIe bus, within a few days (It does not occurs very often), it happens that the interrupt of the FPGA is shown up on irq 299 where the handler is not registered and nobody cares about the interrupt:
72: 0 0 0 0 GPC 114 Level mmdc_1
73: 0 0 0 0 GPC 8 Level 2800000.ipu
74: 0 0 0 0 GPC 7 Level 2800000.ipu
241: 0 0 0 0 gpio-mxc 6 Edge ad7606
299: 100000 0 0 0 GPC 123 Level PCIe PME
300: 544352 0 0 0 GPC 121 Level 16Z087
301: 16 0 0 0 GIC-0 137 Level 2101000.jr0
[171521.126091] irq 299: nobody cared (try booting with the "irqpoll" option)
…
[171521.135735] Disabling IRQ #299
We already know that we have to disable MSI for legacy interrupts otherwise nothing would work. After the first received wrong interrupt the IRQ gets disabled and the communication with the FPGA stops working. We check the interrupt behavior also manually after the error case and it really shows that the interrupt of the FPGA is shown on a different bit of the GIC:
Reading CPU irq status register of CPU before error:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff7024): 0x82000000
Reading CPU irq status register of CPU after error:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f16024): 0x84000000
Disable irq within FPGA and read status register of CPU again:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f0b024): 0x80000000
Reenable irq within FPGA and read status register of CPU again:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff4024): 0x84000000
- Did we missed something when disabling the MSI?
- Or is this an already known issue with an existing workaround?
- What can be the reason for this behavior?
Regards,
Andreas
Hi Glen,
To reproduce the problem you must ensure that legacy interrupts are used. I am not familiar with the Intel NIC.
Check the following link for an example:
http://lists.infradead.org/pipermail/openwrt-devel/2017-April/007107.html
Here someone has also a card with only legacy interrupts. I had to do the same patch on the your BSP to get the legacy interrupts working.
So I guess with the ath9k it should be possible to reproduce the problem with applying the given patch.
Yes, because there is no issue with MSI interrupts and it is not possible that the PCIe controller inside the FPGA sends other vectors than INTA when using legacy interrupts (a Hard IP from Altera is used => not our development).
I don't think there is a general problem of PCIe it is only about legacy interrupts.
I have run a test for ~4-5 weeks now with MSI and had not a single fault.
Regards,
Andreas
Hi Glen,
I guess to reproduce the problem you will need a PCIe endpoint device with more than one shared legacy interrupt. I our case there were 4 ip cores sharing the INTA PCIe legacy interrupt. Then you will need to produce a high load of interrupt for all interrupt sources.
We ran iperf on 2 ethernet ip cores and our own UART test on 2 high Speed uarts (all ip cores were developed by us) to get a high load.
But i guess that it will be difficult to find a proper PCIe endpoint providing legacy interrupts with INTA shared with multiple devices.
Regards,
Andreas
Hi andreasgeißler,
R&D has set up a board with the PME and an Intel NIC sharing the same IRQ. R&D asks if this is sufficient to reproduce the problem?
Also has the FPGA been ruled out as a root cause?
Have you tested using one of our boards in a similar configuration?
We have tested the PCIe system on our boards (i.MX6Q Sabre) and have never seen this type of problem.
BR,
Glen
Hi Andreas,
I was just reading your post and found it interesting. I think you will not see the problem with MSI since it bypass interrupt controller and interrupt CPU with a memory write. I think there is a serious problem with IO interrupt controller on i.MX6. I hope I am wrong and I would be really happy if we found a software bug.
Regards,
Adeel
Hi Glen,
Sorry for late reply you. I am very busy at the moment:
Because the problem is very urgent, we decided to implement MSI vectors within our drivers and within the FPGA. We figured out that with MSI support the issue does not occur any more until now (~2 weeks). We will go on with this solution and check if the problem is fixed even in the real environment at the customers side.
If the urgency subsides I will check what happens with pci=nomsi. But currently I have only one system where I have to check the solution with MSI vectors.
Regards,
Andreas
Hi Glenn,
I'm experiencing difficulties to use a WPEA-121N wifi PCIe board, using ath9k driver. The connection doesn't stay on for more than a few seconds. I stumbled uppon this thread and disabled MSI which solved my problem. If this problem is related, this could be a very easy way to reproduce as it only takes seconds for the issue to appear.
I'm using a TEP1560IMX6 board by TechNexion, with https://github.com/TechNexion/linux/tree/tn-imx_4.9.88_2.0.0_ga-test, but i've reproduced the issue with https://source.codeaurora.org/external/imx/linux-imx/tag/?h=rel_imx_4.9.88_2.0.0_ga. I expect that you could reproduce this issue on a sabre board.
Best regards,
Adrien
Hi Glen,
Thanks for the very fast reply and for forwarding the results. In parallel i will check the behavior when PME is completely removed from the kernel.
Thanks,
Andreas
Hi Andreas,
No new info. I'll make R&D aware that didn't work.
Can you try one item? I have a sneaky feeling that the MSI code may be the culprit.
Add 'pci=nomsi' (without quotes) to the u-boot kernel command line, and see if that changes anything.
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts
BR,
Glen
Hi glen,
I have implemented the changes today and started the test again. But i am not sure if this is realy enough to get some new information. The driver has only a few debug messages. And i could not build the driver as a kernel module, because this is not provided by the kernel config. it seems to be a static library?
Could I do anything else to get more detailed debug messages for the issue?
Regards,
Andreas
HI andreasgeißler,
R&D replied on the query about building PCIe into the kernel.
What I meant was not to compile the PCIe driver to a KLM.
Just don't build the PCIe PME into kernel.
config PCIE_PME
def_bool y
depends on PCIEPORTBUS && PM
BR,
Glen
Hello Glen,
I will give it a try. With separately you mean as a loadable kernel module?
And by the way I tried to reproduce the problem again. and this time it took 5 days. The occurence can be also very rare.
It will take sometime to come back with the result of the DEBUG message. I will keep you informed
Thanks,
Andreas