i.MX7D processor sporadically freezing completely when both A-cores are in use

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

i.MX7D processor sporadically freezing completely when both A-cores are in use

576 次查看
lehtor
Contributor I

Hello,

Me and my team are having problems with i.MX 7Dual processor (more specifically this model: MCIMX7D5EVM10SC) running Linux built with Yocto.

The processor keeps freezing seemingly at random, but only if both A-cores are in use. We are not using the M-core at all. The freezing behavior seems to be dependent on the specific board as we are seeing the same exact boards freeze on a semi-regular interval while other boards of the same production batch do not freeze at all or very rarely (1-5 times a week). The freeze happens most commonly between 5-60 minutes on the misbehaving boards, as in one board may constantly freeze after 5-20 minutes and an another board between 30-40 minutes, so it seems to be somewhat hardware dependent. No output from the attached COM port when the freeze happens, no stacktrace or anything, just a complete freeze. After the freeze the board reboots, most likely by the watchdog running in the board. Setting the second core offline stops this behavior completely, event on the rarely freezing ones.

As mentioned, we've narrowed the behavior down to being somehow dependent on the processor cores being used. Here's a list of things we've tried and noted about the behavior:

  • The faulty behavior seems to completely stop if we disable the second core, either by running the Linux kernel without CONFIG_SMP or just setting echo 0 > /sys/devices/system/cpu/cpu1/online
  • The freeze condition lasts at most 1 minute, which is expected as our watchdog interval is set to 1 minute
  • While the freeze is ongoing we get no output from the attached COM port nor do we get echoed any input we try to give it through the same COM port
  • Attaching a JTAG-debugger to the CPU and running gdb via openocd attached to the kernel revealed the following:
    • gdb or openocd work as expected, breakpoints working as intended
    • neither gdb or openocd automatically detect the freeze condition
      • openocd detects the board rebooting, but nothing when the CPU is frozen
      • gdb gets thrown back to the last place it was stopped at when the board reboots, either to the attach point or the last breakpoint we stopped at
    • Manually breaking gdb with Ctrl+C when we are in the freeze condition shows the following:
      • cpsr register is set to 0x200f019b which seems to indicate the CPU is in Undefined instruction mode according to the arm manual
      • backtrace shows we are in function __vectors_lma under vector_und in the kernel, this seems to bee some kind of trap in the kernel for undefined behavior
    • Creating a deliberate undefined instruction to a kernel module (e.g. dividing by zero, calling __asm__("UDF #0") or casting random data to a function and calling it) and triggering it manually from userspace generates a stacktrace and kernel panic as expected and the CPU is in Supervisor mode instead of Undefined instruction mode
      • This seems to indicate it's not a basic kernel bug, but I'm not 100% sure that it for sure isn't
  • Decreasing CPU load on the board to a minimum doesn't solve the problem, but it gets a bit more rare (1.5-2x more time working as intended, e.g. freezing after 30-40min instead of 20min)
  • Setting the CPU scaling governor to powersave and running the CPU at 800MHz doesn't seem to have any effect
  • Re-enabling the second core after running the board for a while with one core restores the faulty behavior, crashing after some time
  • Attaching the board to a different power supply changes the behavior, but doesn't solve the issue
    • The board is rated to run between 24-48 V DC
    • My personal lab power supply is 24 V and our testing environment uses 48 V
    • Attaching a board to 24 V seems to make the problem more rare, but not solve the issue

We would appreciate if anyone has any ideas what could be the cause of this problem or if we should maybe check or measure something more, or if we have missed some crucial debugging step which could shed some more light on this issue. I personally am mainly a backend developer and have very rudimentary knowledge of hardware or hardware debugging, so it's a very likely case I've just missed something obvious.

- Riku

0 项奖励
回复
2 回复数

543 次查看
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

I’m able to reproduce the issue on BSP 5.15  and debugging the kernel. But I discover that the issue was fixed in new BSP 6.1v so please download the new kernel.

https://www.nxp.com/design/design-center/software/embedded-software/i-mx-software/embedded-linux-for...

 

Regards

0 项奖励
回复

523 次查看
lehtor
Contributor I

Hello,

thanks for the reply, but it seems that our build pipeline is using FSL Community BSP instead of the one you linked to. Can the build system you linked to somehow be used with FSL Community BSP or do I have to change the whole build system to it if I want to test that one?

0 项奖励
回复