i.MX6 PCIe interrupt mapping changes during runtime

andreasgeißler · ‎08-22-2018

Hello,

I am using an i.MX6qdl processor on a customer platform. An FPGA is connected to the PCIe interface to communicate with the internal IP's, using legacy interrupts. We figured out, that when running long-term tests sometimes the interrupt mapping seems to be changed for some reason. The interrupt of the FPGA is mapped to irq 300 after booting. After starting the long-term tests do perform traffic on the PCIe bus, within a few days (It does not occurs very often), it happens that the interrupt of the FPGA is shown up on irq 299 where the handler is not registered and nobody cares about the interrupt:

72: 0 0 0 0 GPC 114 Level mmdc_1

73: 0 0 0 0 GPC 8 Level 2800000.ipu

74: 0 0 0 0 GPC 7 Level 2800000.ipu

241: 0 0 0 0 gpio-mxc 6 Edge ad7606

299: 100000 0 0 0 GPC 123 Level PCIe PME

300: 544352 0 0 0 GPC 121 Level 16Z087

301: 16 0 0 0 GIC-0 137 Level 2101000.jr0

[171521.126091] irq 299: nobody cared (try booting with the "irqpoll" option)

…

[171521.135735] Disabling IRQ #299

We already know that we have to disable MSI for legacy interrupts otherwise nothing would work. After the first received wrong interrupt the IRQ gets disabled and the communication with the FPGA stops working. We check the interrupt behavior also manually after the error case and it really shows that the interrupt of the FPGA is shown on a different bit of the GIC:

Reading CPU irq status register of CPU before error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff7024): 0x82000000

Reading CPU irq status register of CPU after error:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f16024): 0x84000000

Disable irq within FPGA and read status register of CPU again:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f0b024): 0x80000000

Reenable irq within FPGA and read status register of CPU again:

root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff4024): 0x84000000

- Did we missed something when disabling the MSI?

- Or is this an already known issue with an existing workaround?

- What can be the reason for this behavior?

Regards,

Andreas

gfine · ‎09-11-2018

Hi andreasgeißler‌,

Yes, as a LKM with DEBUG turned on.

Let me know as soon as the error recurs.

Cheers,

Glen

gfine · ‎09-20-2018

Hi andreasgeißler‌,

It has been a week. Any luck in recreating the problem?

BR,

Glen

andreasgeißler · ‎08-28-2018

Hi Yuri and Glen,

Sorry, i have replied on the email and it seems that there is a problem with our signature. Here is what i have sent:

Thanks for fast reply.

We are using the latest version of Yocto (sumo branch). With the corresponding meta-freescale layer. The used kernel version is v4.9.88 with the recipe from meta-freescale.

Regards,

Andreas

gfine · ‎08-28-2018

Hi andreasgeißler,

What did you use as your source? The one below?

https://source.codeaurora.org/external/imx/imx-manifest/tree/?h=imx-linux-rocko

I was not aware we have a 'sumo' build that we support at this time. I think 'sumo' is reserved for a future release.

Can you try using the 'rocko' release with 4.9.88 and get back to me as ASAP

BR,

Glen

andreasgeißler · ‎08-29-2018

Hi Glen,

We have used the sumo branch version from:

http://git.yoctoproject.org/cgit.cgi/meta-freescale/?h=sumo

to build the Yocto project.

I will try to reproduce the issue with the version from:

https://github.com/Freescale/fsl-community-bsp-platform

but it seems that this rocko branch version only supports kernel version v4.9.11.

And it will take some time to get back with new results because i have to adapted our layers and recipes. In addion the issue does not happen very often it occurs once within 3 days running the iperf tests.

Regards,

Andreas

gfine · ‎08-30-2018

Hi andreasgeißler,

As mentioned we do not support anything in the sumo release.

Our 4.9.88 is available on

https://source.codeaurora.org/external/imx/imx-manifest/tree/README?h=imx-linux-rocko

The fsl-community (which you have linked to in your reply) is supported by Octavio Salvador and we, NXP, have no control on its content. Hence we do not support it.

The content we do support is on the Code Aurora link above. If you can reproduce the problem with the NXP supported BSP we can work on it.

Anyhow, a drifting interrupt indicates there is something amiss in that kernel's interrupt handler. You may want to make Octavio aware of what you found.

BR,

Glen

andreasgeißler · ‎09-05-2018

Hi Glen,

I could reproduce the problem also with the Code Aurora BSP. The interrupt of the FPGA was disabled after 21h, while running the long term load tests. It seems to be a little bit different this time but I guess the source of the problem is the same:

The IRQ becomes disabled because no handler cared about:

[75800.350797] irq 300: nobody cared (try booting with the "irqpoll" option)
[75800.356296] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.88-imx_4.9.88_2.0.0_ga #2
[75800.363865] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[75800.369119] [<801108b4>] (unwind_backtrace) from [<8010c1b0>] (show_stack+0x10/0x14)
[75800.375573] [<8010c1b0>] (show_stack) from [<803ccb2c>] (dump_stack+0x78/0x8c)
...
[75800.585058] [<8010cb8c>] (__irq_svc) from [<8076ca28>] (cpuidle_enter_state+0x13c/0x2cc)
[75800.591855] [<8076ca28>] (cpuidle_enter_state) from [<8016a2a0>] (cpu_startup_entry+0x168/0x228)
[75800.599351] [<8016a2a0>] (cpu_startup_entry) from [<80f00c58>] (start_kernel+0x374/0x380)
[75800.606225] handlers:
[75800.607209] [<7f0604c0>] z77_irq [men_lx_z77]
[75800.610288] [<7f0604c0>] z77_irq [men_lx_z77]
[75800.613365] [<7f057930>] men_z135_intr [men_lx_z135]
[75800.617045] [<7f057930>] men_z135_intr [men_lx_z135]
[75800.620721] Disabling IRQ #300

This time it seems that an other interrupt which is not the FPGA occurred on the same interrupt vector:

241:          0          0          0          0 gpio-mxc   6 Edge      ad7606
299:          0          0          0          0       GPC 123 Level     PCIe PME
300: 1283785018          0          0          0       GPC 121 Level     16Z087, 16Z087, men_z135_intr, men_z135_intr
301:         16          0          0          0     GIC-0 137 Level     2101000.jr0
302:          0          0          0          0     GIC-0 138 Level     2102000.jr1

What is interesting is that a second interrupt is still active in the error case:

root@imx6qsabresd:~/log# memrw 0x020dc014 l
Value at 0x20dc014 (0x76efd014): 0xF738F7FF
root@imx6qsabresd:~/log# memrw 0x020dc024 l
Value at 0x20dc024 (0x76fbd024): 0x83000000

When disabling the interrupt of the FPGA the CPU register looks like followed (means FPGA IRQ is 0x01000000):

root@imx6qsabresd:~/log# memrw 0x020dc024 l
Value at 0x20dc024 (0x76fae024): 0x82000000

After reboot the interrupt of the FPGA appears on an other interrupt as at the error case (FPGA IRQ is 0x02000000):

Value at 0x20dc024 (0x76fd5024): 0x82000000

Unfortunately, I have forgotten to check on with interrupt the FPGA is before the error appeared. So I started yesterday a second test, do ensure that the interrupt of the FPGA is really changing. No issue so far on this iteration.

Please find attached the full dmesg log and the interrupt behavior.

Please let me know if you need other information.

Regards,

Andreas

gfine · ‎09-07-2018

Hi andreasgeißler,

Our developer asked :

Regarding to the log, the irq123 (299) that nobody cared about, and it is used by the PCIe PME driver. Can the customer build-out the PCIe PME driver, then do the tests again? Thus, it would be helpful to see the result of the debug this issue.

I think, in a nutshell, he is asking if you can rebuild the PCIe PME driver separately, compile with DEBUG turned on, and report your findings.

Is this possible?

BR,

Glen

gfine · ‎09-05-2018

Hi andreasgeißler,

I have created a defect ticket for this (MLK-19463). I know the developers are probably going to ask how to recreate, and ask for details about how the FPGA and PCIe are being mapped or if a custom driver is involved.

BR,

Glen

andreasgeißler · ‎09-03-2018

Hi Glen,

I have now started our test on the NXP supported BSP. I took a while to get everything working again. I will keep you informed about any results of the test. Let's see if it also occurs on this BSP.

Find attached the dmesg output with the NXP supported BSP.

Regards,

Andreas

gfine · ‎09-04-2018

Hi Andreas,

I can understand the process (and pains) of rebuilding the OS.

But now I can support you if you recreate the same problem.

Cheers,

Glen

gfine · ‎08-28-2018

Hi andreasgeißler‌,

We can not proceed any further without the version and source of the Linux OS being used.

BR,

Glen

Yuri · ‎08-28-2018

Hello,

What Linux release is used in the case?

Regards,

Yuri.

andreasgeißler · ‎08-28-2018

Hello Yuri,

Thanks for fast reply.

We are using the latest version of Yocto (sumo branch). With the corresponding meta-freescale layer. The used kernel version is v4.9.88 with the recipe from meta-freescale.

Regards,

Andreas

B. Eng.

Andreas Geißler

FPGA Engineer

MEN Mikro Elektronik GmbH

Neuwieder Straße 3-7

90411 Nürnberg, Germany

Phone +49 911 99 33 5 - 230

Fax +49 911 / 99 33 5 - 901

Andreas.Geissler@men.de<mailto:Andreas.Geissler@men.de>

www.men.de<http://www.men.de>;

Subscribe to our newsletter<https://www.men.de/news-media/newsletter/>

MEN Mikro Elektronik GmbH - Bernd Härtlein (CEO) - Yilmaz Kocak (CFO) - Handelsregister/Trade Register AG Nürnberg HRB 5540

Please consider the environment before printing this e-mail

i.MX6 PCIe interrupt mapping changes during runtime

i.MX6 PCIe interrupt mapping changes during runtime

i.MX6_All

i.MX6Quad