i.MX6 PCIe interrupt mapping changes during runtime

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

i.MX6 PCIe interrupt mapping changes during runtime

6,675 Views
andreasgeißler
Contributor II

Hello,

I am using an i.MX6qdl processor on a customer platform. An FPGA is connected to the PCIe interface to communicate with the internal IP's, using legacy interrupts. We figured out, that when running long-term tests sometimes the interrupt mapping seems to be changed for some reason. The interrupt of the FPGA is mapped to irq 300 after booting. After starting the long-term tests do perform traffic on the PCIe bus, within a few days (It does not occurs very often), it happens that the interrupt of the FPGA is shown up on irq 299 where the handler is not registered and nobody cares about the interrupt:

72:          0          0          0          0       GPC 114 Level     mmdc_1

73:          0          0          0          0       GPC   8 Level     2800000.ipu

74:          0          0          0          0       GPC   7 Level     2800000.ipu

241:          0          0          0          0  gpio-mxc   6 Edge      ad7606

299:     100000          0          0          0       GPC 123 Level     PCIe PME

300:     544352          0          0          0       GPC 121 Level     16Z087

301:         16          0          0          0     GIC-0 137 Level     2101000.jr0

 

[171521.126091] irq 299: nobody cared (try booting with the "irqpoll" option)

[171521.135735] Disabling IRQ #299

We already know that we have to disable MSI for legacy interrupts otherwise nothing would work. After the first received wrong interrupt the IRQ gets disabled and the communication with the FPGA stops working. We check the interrupt behavior also manually after the error case and it really shows that the interrupt of the FPGA is shown on a different bit of the GIC:


Reading CPU irq status register of CPU before error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff7024): 0x82000000


Reading CPU irq status register of CPU after error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f16024): 0x84000000

Disable irq within FPGA and read status register of CPU again:


root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f0b024): 0x80000000

Reenable irq within FPGA and read status register of CPU again:


root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff4024): 0x84000000

- Did we missed something when disabling the MSI?

- Or is this an already known issue with an existing workaround?

- What can be the reason for this behavior?

Regards,

Andreas

Labels (2)
34 Replies

3,612 Views
andreasgeißler
Contributor II

Hi Glen,

To reproduce the problem you must ensure that legacy interrupts are used. I am not familiar with the Intel NIC. 

Check the following link for an example:

http://lists.infradead.org/pipermail/openwrt-devel/2017-April/007107.html

Here someone has also a card with only legacy interrupts. I had to do the same patch on the your BSP to get the legacy interrupts working.

So I guess with the ath9k it should be possible to reproduce the problem with applying the given patch.

Yes, because there is no issue with MSI interrupts and it is not possible that the PCIe controller inside the FPGA sends other vectors than INTA when using legacy interrupts (a Hard IP from Altera is used => not our development).

I don't think there is a general problem of PCIe it is only about legacy interrupts.

I have run a test for ~4-5 weeks now with MSI and had not a single fault.

Regards,

Andreas

0 Kudos

3,611 Views
karina_valencia
NXP Apps Support
NXP Apps Support

gfine‌, did  you get last comment from R&D?

0 Kudos

3,612 Views
gfine
NXP Employee
NXP Employee

Already handled via e-mail.

0 Kudos

3,612 Views
andreasgeißler
Contributor II

Hi Glen,

I guess to reproduce the problem you will need a PCIe endpoint device with more than one shared legacy interrupt. I our case there were 4 ip cores sharing the INTA PCIe legacy interrupt. Then you will need to produce a high load of interrupt for all interrupt sources.

We ran iperf on 2 ethernet ip cores and our own UART test on 2 high Speed uarts (all ip cores were developed by us) to get a high load.

But i guess that it will be difficult to find a proper PCIe endpoint providing legacy interrupts with INTA shared with multiple devices.

Regards,

Andreas

0 Kudos

3,612 Views
gfine
NXP Employee
NXP Employee

Hi andreasgeißler‌,

R&D has set up a board with the PME and an Intel NIC sharing the same IRQ.  R&D asks if this is sufficient to reproduce the problem?  

Also has the FPGA been ruled out as a root cause?  

Have you tested using one of our boards in a similar configuration?

We have tested the PCIe system on our boards (i.MX6Q Sabre) and have never seen this type of problem. 

BR,


Glen

0 Kudos

3,612 Views
adeel
Contributor III

I think important would be that it should be a shared interrupt from source i.e. FPGA side.

0 Kudos

3,611 Views
gfine
NXP Employee
NXP Employee

Hi Andreas,

Thank you.  I'll pass this onto the developers

BR,


Glen

0 Kudos

3,612 Views
adeel
Contributor III

Hi Andreas,

I was just reading your post and found it interesting. I think you will not see the problem with MSI since it bypass interrupt controller and interrupt CPU with a memory write. I think there is a serious problem with IO interrupt controller on i.MX6. I hope I am wrong and I would be really happy if we found a software bug.

Regards,

Adeel

0 Kudos

3,612 Views
andreasgeißler
Contributor II

Hi Glen,

Sorry for late reply you. I am very busy at the moment:

Because the problem is very urgent, we decided to implement MSI vectors within our drivers and within the FPGA. We figured out that with MSI support the issue does not occur any more until now (~2 weeks). We will go on with this solution and check if the problem is fixed even in the real environment at the customers side.

If the urgency subsides I will check what happens with pci=nomsi. But currently I have only one system where I have to check the solution with MSI vectors.

Regards,

Andreas

0 Kudos

3,612 Views
gfine
NXP Employee
NXP Employee

Hi Andreas,

I need a good recreation scenario to reproduce the problem. I know it takes 2 weeks or so to reproduce.  What are you running on the target system during the test?

BR,


Glen

0 Kudos

3,612 Views
adrien_bruant
Contributor I

Hi Glenn,

I'm experiencing difficulties to use a WPEA-121N wifi PCIe board, using ath9k driver. The connection doesn't stay on for more than a few seconds. I stumbled uppon this thread and disabled MSI which solved my problem. If this problem is related, this could be a very easy way to reproduce as it only takes seconds for the issue to appear.

I'm using a TEP1560IMX6 board by TechNexion, with https://github.com/TechNexion/linux/tree/tn-imx_4.9.88_2.0.0_ga-test, but i've reproduced the issue with https://source.codeaurora.org/external/imx/linux-imx/tag/?h=rel_imx_4.9.88_2.0.0_ga. I expect that you could reproduce this issue on a sabre board.

Best regards,

Adrien

0 Kudos

3,615 Views
andreasgeißler
Contributor II

Hi Glen,

Thanks for the very fast reply and for forwarding the results. In parallel i will check the behavior when PME is completely removed from the kernel.

Thanks,

Andreas

0 Kudos

3,611 Views
andreasgeißler
Contributor II

Hi Glen,

Unfortunately, even after setting the PCIE_PME config to 'n'  the behavior seems the same.

Are there any news from R&D.

Regards,

Andreas

0 Kudos

3,611 Views
gfine
NXP Employee
NXP Employee

Hi andreasgeißler‌,

R&D is asking of current status. Did the problem recur with 'pci=nomsi"?

BR,

Glen

0 Kudos

3,611 Views
gfine
NXP Employee
NXP Employee

Hi Andreas,

No new info. I'll make R&D aware that didn't work.

Can you try one item?  I have a sneaky feeling that the MSI code may be the culprit. 

Add 'pci=nomsi' (without quotes) to the u-boot kernel command line, and see if that changes anything.

https://en.wikipedia.org/wiki/Message_Signaled_Interrupts

BR,


Glen

0 Kudos

3,614 Views
andreasgeißler
Contributor II

Hi glen,

I have implemented the changes today and started the test again. But i am not sure if this is realy enough to get some new information. The driver has only a few debug messages. And i could not build the driver as a kernel module, because this is not provided by the kernel config. it seems to be a static library?

Could I do anything else to get more detailed debug messages for the issue?

Regards,

Andreas

0 Kudos

3,614 Views
gfine
NXP Employee
NXP Employee

HI andreasgeißler‌,

R&D replied on the query about building PCIe into the kernel.

What I meant was not to compile the PCIe driver to a KLM.

Just don't build the PCIe PME into kernel.

config PCIE_PME
def_bool y
depends on PCIEPORTBUS && PM

BR,

Glen

0 Kudos

3,614 Views
andreasgeißler
Contributor II

Hello Glen,

I have added the DEBUG to the PCI PME driver and could reproduce the problem again. I have attached the dmesg log, but could not anything new or something else what could help to find the reason for the problem

Regards,

Andreas

0 Kudos

3,614 Views
gfine
NXP Employee
NXP Employee

Hi Andreas,

Pushed the dmesg to R&D. Should have a reply in a few days.

I can see the IRQ assignment (and reassignments)  starting at timestamp [ 0.597072]. 

BR,


Glen

0 Kudos

3,614 Views
andreasgeißler
Contributor II

Hello Glen,

I will give it a try. With separately you mean as a loadable kernel module?
And by the way I tried to reproduce the problem again. and this time it took 5 days. The occurence can be also very rare.

It will take sometime to come back with the result of the DEBUG message. I will keep you informed

Thanks,

Andreas

0 Kudos