Hello,
I am using an i.MX6qdl processor on a customer platform. An FPGA is connected to the PCIe interface to communicate with the internal IP's, using legacy interrupts. We figured out, that when running long-term tests sometimes the interrupt mapping seems to be changed for some reason. The interrupt of the FPGA is mapped to irq 300 after booting. After starting the long-term tests do perform traffic on the PCIe bus, within a few days (It does not occurs very often), it happens that the interrupt of the FPGA is shown up on irq 299 where the handler is not registered and nobody cares about the interrupt:
72: 0 0 0 0 GPC 114 Level mmdc_1
73: 0 0 0 0 GPC 8 Level 2800000.ipu
74: 0 0 0 0 GPC 7 Level 2800000.ipu
241: 0 0 0 0 gpio-mxc 6 Edge ad7606
299: 100000 0 0 0 GPC 123 Level PCIe PME
300: 544352 0 0 0 GPC 121 Level 16Z087
301: 16 0 0 0 GIC-0 137 Level 2101000.jr0
[171521.126091] irq 299: nobody cared (try booting with the "irqpoll" option)
…
[171521.135735] Disabling IRQ #299
We already know that we have to disable MSI for legacy interrupts otherwise nothing would work. After the first received wrong interrupt the IRQ gets disabled and the communication with the FPGA stops working. We check the interrupt behavior also manually after the error case and it really shows that the interrupt of the FPGA is shown on a different bit of the GIC:
Reading CPU irq status register of CPU before error:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff7024): 0x82000000
Reading CPU irq status register of CPU after error:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f16024): 0x84000000
Disable irq within FPGA and read status register of CPU again:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76f0b024): 0x80000000
Reenable irq within FPGA and read status register of CPU again:
root@men-cc10:~# memrw 0x020dc024 l
Value at 0x20dc024 (0x76ff4024): 0x84000000
- Did we missed something when disabling the MSI?
- Or is this an already known issue with an existing workaround?
- What can be the reason for this behavior?
Regards,
Andreas
Hi andreasgeißler,
Yes, as a LKM with DEBUG turned on.
Let me know as soon as the error recurs.
Cheers,
Glen
Hi Yuri and Glen,
Sorry, i have replied on the email and it seems that there is a problem with our signature. Here is what i have sent:
Thanks for fast reply.
We are using the latest version of Yocto (sumo branch). With the corresponding meta-freescale layer. The used kernel version is v4.9.88 with the recipe from meta-freescale.
Regards,
Andreas
Hi andreasgeißler,
What did you use as your source? The one below?
https://source.codeaurora.org/external/imx/imx-manifest/tree/?h=imx-linux-rocko
I was not aware we have a 'sumo' build that we support at this time. I think 'sumo' is reserved for a future release.
Can you try using the 'rocko' release with 4.9.88 and get back to me as ASAP
BR,
Glen
Hi Glen,
We have used the sumo branch version from:
http://git.yoctoproject.org/cgit.cgi/meta-freescale/?h=sumo
to build the Yocto project.
I will try to reproduce the issue with the version from:
https://github.com/Freescale/fsl-community-bsp-platform
but it seems that this rocko branch version only supports kernel version v4.9.11.
And it will take some time to get back with new results because i have to adapted our layers and recipes. In addion the issue does not happen very often it occurs once within 3 days running the iperf tests.
Regards,
Andreas
Hi andreasgeißler,
As mentioned we do not support anything in the sumo release.
Our 4.9.88 is available on
https://source.codeaurora.org/external/imx/imx-manifest/tree/README?h=imx-linux-rocko
The fsl-community (which you have linked to in your reply) is supported by Octavio Salvador and we, NXP, have no control on its content. Hence we do not support it.
The content we do support is on the Code Aurora link above. If you can reproduce the problem with the NXP supported BSP we can work on it.
Anyhow, a drifting interrupt indicates there is something amiss in that kernel's interrupt handler. You may want to make Octavio aware of what you found.
BR,
Glen
Hi Glen,
I could reproduce the problem also with the Code Aurora BSP. The interrupt of the FPGA was disabled after 21h, while running the long term load tests. It seems to be a little bit different this time but I guess the source of the problem is the same:
The IRQ becomes disabled because no handler cared about:
[75800.350797] irq 300: nobody cared (try booting with the "irqpoll" option)
[75800.356296] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.88-imx_4.9.88_2.0.0_ga #2
[75800.363865] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[75800.369119] [<801108b4>] (unwind_backtrace) from [<8010c1b0>] (show_stack+0x10/0x14)
[75800.375573] [<8010c1b0>] (show_stack) from [<803ccb2c>] (dump_stack+0x78/0x8c)
...
[75800.585058] [<8010cb8c>] (__irq_svc) from [<8076ca28>] (cpuidle_enter_state+0x13c/0x2cc)
[75800.591855] [<8076ca28>] (cpuidle_enter_state) from [<8016a2a0>] (cpu_startup_entry+0x168/0x228)
[75800.599351] [<8016a2a0>] (cpu_startup_entry) from [<80f00c58>] (start_kernel+0x374/0x380)
[75800.606225] handlers:
[75800.607209] [<7f0604c0>] z77_irq [men_lx_z77]
[75800.610288] [<7f0604c0>] z77_irq [men_lx_z77]
[75800.613365] [<7f057930>] men_z135_intr [men_lx_z135]
[75800.617045] [<7f057930>] men_z135_intr [men_lx_z135]
[75800.620721] Disabling IRQ #300
This time it seems that an other interrupt which is not the FPGA occurred on the same interrupt vector:
241: 0 0 0 0 gpio-mxc 6 Edge ad7606
299: 0 0 0 0 GPC 123 Level PCIe PME
300: 1283785018 0 0 0 GPC 121 Level 16Z087, 16Z087, men_z135_intr, men_z135_intr
301: 16 0 0 0 GIC-0 137 Level 2101000.jr0
302: 0 0 0 0 GIC-0 138 Level 2102000.jr1
What is interesting is that a second interrupt is still active in the error case:
root@imx6qsabresd:~/log# memrw 0x020dc014 l
Value at 0x20dc014 (0x76efd014): 0xF738F7FF
root@imx6qsabresd:~/log# memrw 0x020dc024 l
Value at 0x20dc024 (0x76fbd024): 0x83000000
When disabling the interrupt of the FPGA the CPU register looks like followed (means FPGA IRQ is 0x01000000):
root@imx6qsabresd:~/log# memrw 0x020dc024 l
Value at 0x20dc024 (0x76fae024): 0x82000000
After reboot the interrupt of the FPGA appears on an other interrupt as at the error case (FPGA IRQ is 0x02000000):
Value at 0x20dc024 (0x76fd5024): 0x82000000
Unfortunately, I have forgotten to check on with interrupt the FPGA is before the error appeared. So I started yesterday a second test, do ensure that the interrupt of the FPGA is really changing. No issue so far on this iteration.
Please find attached the full dmesg log and the interrupt behavior.
Please let me know if you need other information.
Regards,
Andreas
Hi andreasgeißler,
Our developer asked :
Regarding to the log, the irq123 (299) that nobody cared about, and it is used by the PCIe PME driver. Can the customer build-out the PCIe PME driver, then do the tests again? Thus, it would be helpful to see the result of the debug this issue.
I think, in a nutshell, he is asking if you can rebuild the PCIe PME driver separately, compile with DEBUG turned on, and report your findings.
Is this possible?
BR,
Glen
Hi andreasgeißler,
I have created a defect ticket for this (MLK-19463). I know the developers are probably going to ask how to recreate, and ask for details about how the FPGA and PCIe are being mapped or if a custom driver is involved.
BR,
Glen
Hi Andreas,
I can understand the process (and pains) of rebuilding the OS.
But now I can support you if you recreate the same problem.
Cheers,
Glen
Hi andreasgeißler,
We can not proceed any further without the version and source of the Linux OS being used.
BR,
Glen
Hello,
What Linux release is used in the case?
Regards,
Yuri.
Hello Yuri,
Thanks for fast reply.
We are using the latest version of Yocto (sumo branch). With the corresponding meta-freescale layer. The used kernel version is v4.9.88 with the recipe from meta-freescale.
Regards,
Andreas
B. Eng.
Andreas Geißler
FPGA Engineer
MEN Mikro Elektronik GmbH
Neuwieder Straße 3-7
90411 Nürnberg, Germany
Phone +49 911 99 33 5 - 230
Fax +49 911 / 99 33 5 - 901
Andreas.Geissler@men.de<mailto:Andreas.Geissler@men.de>
www.men.de<http://www.men.de>;
Subscribe to our newsletter<https://www.men.de/news-media/newsletter/>
MEN Mikro Elektronik GmbH - Bernd Härtlein (CEO) - Yilmaz Kocak (CFO) - Handelsregister/Trade Register AG Nürnberg HRB 5540
Please consider the environment before printing this e-mail