Hi, a customer of ours having a CMA-related issue using a Digi i.MX8QXP SOM-based custom board running custom Yocto Embedded OS based on the v5.4.70 2.3.3 BXP. The use case consists of a Qt application that is launched as a weston desktop, providing a GUI on a 7 or 10 inch touchscreen. At the moment, there are boards with both B0 and C0 SOCs, both running the same software (except for the imx-boot binary, which requires a different image depending on the SOC revision).
The SOM has a RAM size of 2 GiB and a default CMA region size of 640 MiB.
The problem is, we can observe memory issues on the C0 boards (and only the C0 boards). When the application is launched, we can see constant "alloc_contig_range" messages:
[ 152.966630] alloc_contig_range: 1398 callbacks suppressed
[ 152.966645] alloc_contig_range: [a7b00, a7ee8) PFNs busy
[ 152.978177] alloc_contig_range: [a7c00, a7fe8) PFNs busy
[ 152.983834] alloc_contig_range: [a7c00, a80e8) PFNs busy
[ 152.989454] alloc_contig_range: [a7e00, a81e8) PFNs busy
[ 152.995137] alloc_contig_range: [a7f00, a82e8) PFNs busy
[ 153.000783] alloc_contig_range: [a8000, a83e8) PFNs busy
[ 153.006439] alloc_contig_range: [a8000, a84e8) PFNs busy
[ 153.012226] alloc_contig_range: [a8200, a85e8) PFNs busy
[ 153.033159] alloc_contig_range: [a7b00, a7ee8) PFNs busy
[ 153.039166] alloc_contig_range: [a7c00, a7fe8) PFNs busy
Sometimes, the application even hangs indefinitely. Note that a board with a B0 SOC running the exact same software (kernel + device tree + rootfs + application) doesn't have these symptoms at all. When the CMA size is increased manually to 1280 MiB, the problem still appears.
However, during one of my tests, I tried using the same CMA size as the one used in the i.MX8QXP MEK (960 MiB) and with this size, the behavior is vastly improved. The alloc_contig_range messages aren't completely gone, but they are far less frequent that before and I haven't seen the application hang since.
My two main questions are the following:
- Why does a CMA size of 960 MiB seem to improve the issue and a size of 1280 MiB doesn't? This makes me think this is a kind of hardcoded alignment issue in the graphical libraries, but I'm not sure.
- Why does this only happen on a C0 SOC and not a B0 one, even though all of the relevant components (GPU driver, GPU libraries, Qt libraries, application) are the same in both cases? Do the GPU libraries do things differently depending on the SOC revision?
Many thanks in advance,
Gabriel