i.MX6 PCIe interrupt mapping changes during runtime

andreasgeißler · ‎08-22-2018

Hello,

I am using an i.MX6qdl processor on a customer platform. An FPGA is connected to the PCIe interface to communicate with the internal IP's, using legacy interrupts. We figured out, that when running long-term tests sometimes the interrupt mapping seems to be changed for some reason. The interrupt of the FPGA is mapped to irq 300 after booting. After starting the long-term tests do perform traffic on the PCIe bus, within a few days (It does not occurs very often), it happens that the interrupt of the FPGA is shown up on irq 299 where the handler is not registered and nobody cares about the interrupt:

72: 0 0 0 0 GPC 114 Level mmdc_1

73: 0 0 0 0 GPC 8 Level 2800000.ipu

74: 0 0 0 0 GPC 7 Level 2800000.ipu

241: 0 0 0 0 gpio-mxc 6 Edge ad7606

299: 100000 0 0 0 GPC 123 Level PCIe PME

300: 544352 0 0 0 GPC 121 Level 16Z087

301: 16 0 0 0 GIC-0 137 Level 2101000.jr0

[171521.126091] irq 299: nobody cared (try booting with the "irqpoll" option)

…

[171521.135735] Disabling IRQ #299

We already know that we have to disable MSI for legacy interrupts otherwise nothing would work. After the first received wrong interrupt the IRQ gets disabled and the communication with the FPGA stops working. We check the interrupt behavior also manually after the error case and it really shows that the interrupt of the FPGA is shown on a different bit of the GIC:

Reading CPU irq status register of CPU before error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff7024): 0x82000000

Reading CPU irq status register of CPU after error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f16024): 0x84000000

Disable irq within FPGA and read status register of CPU again:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f0b024): 0x80000000

Reenable irq within FPGA and read status register of CPU again:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff4024): 0x84000000

- Did we missed something when disabling the MSI?

- Or is this an already known issue with an existing workaround?

- What can be the reason for this behavior?

Regards,

Andreas

andreasgeißler · ‎11-15-2018

Hi Glen,

To reproduce the problem you must ensure that legacy interrupts are used. I am not familiar with the Intel NIC.

Check the following link for an example:

http://lists.infradead.org/pipermail/openwrt-devel/2017-April/007107.html

Here someone has also a card with only legacy interrupts. I had to do the same patch on the your BSP to get the legacy interrupts working.

So I guess with the ath9k it should be possible to reproduce the problem with applying the given patch.

Yes, because there is no issue with MSI interrupts and it is not possible that the PCIe controller inside the FPGA sends other vectors than INTA when using legacy interrupts (a Hard IP from Altera is used => not our development).

I don't think there is a general problem of PCIe it is only about legacy interrupts.

I have run a test for ~4-5 weeks now with MSI and had not a single fault.

Regards,

Andreas

karina_valencia · ‎11-12-2018

gfine‌, did you get last comment from R&D?

gfine · ‎11-13-2018

Already handled via e-mail.

andreasgeißler · ‎11-06-2018

Hi Glen,

I guess to reproduce the problem you will need a PCIe endpoint device with more than one shared legacy interrupt. I our case there were 4 ip cores sharing the INTA PCIe legacy interrupt. Then you will need to produce a high load of interrupt for all interrupt sources.

We ran iperf on 2 ethernet ip cores and our own UART test on 2 high Speed uarts (all ip cores were developed by us) to get a high load.

But i guess that it will be difficult to find a proper PCIe endpoint providing legacy interrupts with INTA shared with multiple devices.

Regards,

Andreas

gfine · ‎11-14-2018

Hi andreasgeißler‌,

R&D has set up a board with the PME and an Intel NIC sharing the same IRQ. R&D asks if this is sufficient to reproduce the problem?

Also has the FPGA been ruled out as a root cause?

Have you tested using one of our boards in a similar configuration?

We have tested the PCIe system on our boards (i.MX6Q Sabre) and have never seen this type of problem.

BR,

Glen

adeel · ‎11-16-2018

I think important would be that it should be a shared interrupt from source i.e. FPGA side.

gfine · ‎11-07-2018

Hi Andreas,

Thank you. I'll pass this onto the developers

BR,

Glen

adeel · ‎11-07-2018

Hi Andreas,

I was just reading your post and found it interesting. I think you will not see the problem with MSI since it bypass interrupt controller and interrupt CPU with a memory write. I think there is a serious problem with IO interrupt controller on i.MX6. I hope I am wrong and I would be really happy if we found a software bug.

Regards,

Adeel

andreasgeißler · ‎10-30-2018

Hi Glen,

Sorry for late reply you. I am very busy at the moment:

Because the problem is very urgent, we decided to implement MSI vectors within our drivers and within the FPGA. We figured out that with MSI support the issue does not occur any more until now (~2 weeks). We will go on with this solution and check if the problem is fixed even in the real environment at the customers side.

If the urgency subsides I will check what happens with pci=nomsi. But currently I have only one system where I have to check the solution with MSI vectors.

Regards,

Andreas

gfine · ‎11-06-2018

Hi Andreas,

I need a good recreation scenario to reproduce the problem. I know it takes 2 weeks or so to reproduce. What are you running on the target system during the test?

BR,

Glen

adrien_bruant · ‎05-03-2019

Hi Glenn,

I'm experiencing difficulties to use a WPEA-121N wifi PCIe board, using ath9k driver. The connection doesn't stay on for more than a few seconds. I stumbled uppon this thread and disabled MSI which solved my problem. If this problem is related, this could be a very easy way to reproduce as it only takes seconds for the issue to appear.

I'm using a TEP1560IMX6 board by TechNexion, with https://github.com/TechNexion/linux/tree/tn-imx_4.9.88_2.0.0_ga-test, but i've reproduced the issue with https://source.codeaurora.org/external/imx/linux-imx/tag/?h=rel_imx_4.9.88_2.0.0_ga. I expect that you could reproduce this issue on a sabre board.

Best regards,

Adrien

andreasgeißler · ‎10-15-2018

Hi Glen,

Thanks for the very fast reply and for forwarding the results. In parallel i will check the behavior when PME is completely removed from the kernel.

Thanks,

Andreas

andreasgeißler · ‎10-18-2018

Hi Glen,

Unfortunately, even after setting the PCIE_PME config to 'n' the behavior seems the same.

Are there any news from R&D.

Regards,

Andreas

gfine · ‎10-25-2018

Hi andreasgeißler‌,

R&D is asking of current status. Did the problem recur with 'pci=nomsi"?

BR,

Glen

gfine · ‎10-18-2018

Hi Andreas,

No new info. I'll make R&D aware that didn't work.

Can you try one item? I have a sneaky feeling that the MSI code may be the culprit.

Add 'pci=nomsi' (without quotes) to the u-boot kernel command line, and see if that changes anything.

https://en.wikipedia.org/wiki/Message_Signaled_Interrupts

BR,

Glen

andreasgeißler · ‎10-01-2018

Hi glen,

I have implemented the changes today and started the test again. But i am not sure if this is realy enough to get some new information. The driver has only a few debug messages. And i could not build the driver as a kernel module, because this is not provided by the kernel config. it seems to be a static library?

Could I do anything else to get more detailed debug messages for the issue?

Regards,

Andreas

gfine · ‎10-12-2018

HI andreasgeißler‌,

R&D replied on the query about building PCIe into the kernel.

What I meant was not to compile the PCIe driver to a KLM.
Just don't build the PCIe PME into kernel.
config PCIE_PME
def_bool y
depends on PCIEPORTBUS && PM

BR,

Glen

andreasgeißler · ‎10-12-2018

Hello Glen,

I have added the DEBUG to the PCI PME driver and could reproduce the problem again. I have attached the dmesg log, but could not anything new or something else what could help to find the reason for the problem

Regards,

Andreas

gfine · ‎10-12-2018

Hi Andreas,

Pushed the dmesg to R&D. Should have a reply in a few days.

I can see the IRQ assignment (and reassignments) starting at timestamp [ 0.597072].

BR,

Glen

andreasgeißler · ‎09-11-2018

Hello Glen,

I will give it a try. With separately you mean as a loadable kernel module?
And by the way I tried to reproduce the problem again. and this time it took 5 days. The occurence can be also very rare.

It will take sometime to come back with the result of the DEBUG message. I will keep you informed

Thanks,

Andreas

i.MX6 PCIe interrupt mapping changes during runtime

i.MX6 PCIe interrupt mapping changes during runtime

i.MX6_All

i.MX6Quad