5.15.x+fslc: Compacting glibc code pages causes random process crashes

cschmitz1988 · ‎12-14-2023

Hi, I am part of a team of Linux developers at HPE who work on various embedded boards one of which is based on Freescale's IMX6DL SOC design. We have 1GB of main memory.

We have puzzled over this problem for many months now, so we are desperate to develop a "clean" solution (not based on avoidance strategies). It seems that we have uncovered a bug / race in kernel code mm/compaction.c and mm/migrate.c code can cause random crashes in user space applications rather indiscriminately and with detectable probability. Processes that happen to be scheduled at the same time as kernel thread kcompactd0 seem to be at risk.

Do any of you experience similar issues, especially running a contemporary kernel?

Due to heightened focus on security concerns, we were starting to upgrade the kernel from our previous builds (based on 4.9.11 - https://github.com/Freescale/linux-fslc/tree/4.9-2.3.x-imx) to a much more contemporary build (based on 5.15.77 - https://github.com/Freescale/linux-fslc/tree/5.15.x+fslc).

This is where the trouble started: During weekly regression tests, we observed at least 2 core dumps in every run. Over many weeks, it became apparent that these were no ordinary stability issues:

* Core dumps affected LOTS of different processes. Open source processes such as gawk, python or apache (normally super stable) were affected.

* Core dumps were due to both SIGSEGV (80%) or SIGILL (20%).

* Crashes tended to affect processes that were scheduled often ("CPU hogs") and seem to prefer the ones that were scheduled with elevated priorities (e.g. corosync - nice -20). Some processes even use SCHED_FIFO, prio 90 at times (e.g. proprietary broadcom daemon).

* The core dumps were hard to analyze:

- Stack content was often corrupted making unwinding impossible. Including frame pointer helped a little bit, but not always.

- SIGILL never revealed any true illegal instructions - the code always looked OK in the core files (and in line with what we compiled).

- Crash sites were varied. Only common denominator was that they appeared in library code and tended to cluster around blocking synchronizing primitives (e.g. pthread_mutex_lock, pthread_cond_wait, etc.)

We had noticed before that turning kernel tracing ON would aggravate the core dumps, but tracing gave us novel insight into what was going on right before the fatal signal was generated. We noticed that kcompactd0 was ALWAYS running right before a core dump was observed.

As an experiment, we turned compaction off (CONFIG_COMPACTION=n) and that FIXED the issue. Our stress test (firmware upgrade) would usually reproduce a core dump every 20 iterations or so, but now we ran 1500+ iterations without any issues. This, of course, puts us at risk of memory fragmentation ... (Avoidance Strategy 1)

We started investigating why this was never an issue in 4.9.11 (previous kernel) before. We noticed two areas that had changed.

* New in Kernel 5.15 “proactive compaction”

* New in Kernel 4.20 “watermark_boost_factor”: This feature seemed to always provoke a huge compaction step of order 13 (pageblock_order) in our architecture.

We were able to prove that tuning down vm.compaction_proactiveness=0, vm.watermark_boost_factor=0 would fix the issue for us as well (Avoidance Strategy 2)

None of these features existed in 4.9.11 ! This explains why compaction has never been an issue before (even though enabled in 4.9.11).

We also tried to root cause the migration process:

* The dying process always ran in parallel with kcompactd (sometimes on the same core (context switch), but most often on the alternate core).

* The dying process was always executing code pages in glibc which were migrated just split seconds and a few migration steps before.

Locking glibc memory (via mlockall()) and forbidding migration of locked pages fixes the issue as well (Avoidance Strategy 3)

Additional experiments:

* Tried to disable core 1 temporarily (cpu_remove/cpu_add) and pin kcompactd0 to boot core 0. This is unfortunately not viable for us, since we have real time processes running with tight scheduling constraints (corosync). Running kcompactd0 exclusively for many 100s of milliseconds is not possible.

* Tried to invalidate cache page/TLB very explicitly - I noticed that for our architecture update_mmu_cache is a NOOP. Added flush_cache_page() ... flush_tlb_page() for each "remove_migration_pte" step. This did not help (possibly not a cache coherency issue - this was my pet theory based on https://gitlab.eclipse.org/eclipse/oniro-core/linux/-/commit/4774a369518091f46435e0539de6a45bf0681c7...).

Any help, reply or tip would be greatly appreciated!

Regards,

Christoph Schmitz

Firmware Engineer

Hewlett Packard Enterprises

jimmychan · ‎12-17-2023

The NXP official BSP is here:

https://github.com/nxp-imx/linux-imx