i.MX8X C0 CMA errors during Qt application, not present in B0

gabrielvalcazar · ‎07-15-2021

Hi, a customer of ours having a CMA-related issue using a Digi i.MX8QXP SOM-based custom board running custom Yocto Embedded OS based on the v5.4.70 2.3.3 BXP. The use case consists of a Qt application that is launched as a weston desktop, providing a GUI on a 7 or 10 inch touchscreen. At the moment, there are boards with both B0 and C0 SOCs, both running the same software (except for the imx-boot binary, which requires a different image depending on the SOC revision).

The SOM has a RAM size of 2 GiB and a default CMA region size of 640 MiB.

The problem is, we can observe memory issues on the C0 boards (and only the C0 boards). When the application is launched, we can see constant "alloc_contig_range" messages:

[ 152.966630] alloc_contig_range: 1398 callbacks suppressed
[ 152.966645] alloc_contig_range: [a7b00, a7ee8) PFNs busy
[ 152.978177] alloc_contig_range: [a7c00, a7fe8) PFNs busy
[ 152.983834] alloc_contig_range: [a7c00, a80e8) PFNs busy
[ 152.989454] alloc_contig_range: [a7e00, a81e8) PFNs busy
[ 152.995137] alloc_contig_range: [a7f00, a82e8) PFNs busy
[ 153.000783] alloc_contig_range: [a8000, a83e8) PFNs busy
[ 153.006439] alloc_contig_range: [a8000, a84e8) PFNs busy
[ 153.012226] alloc_contig_range: [a8200, a85e8) PFNs busy
[ 153.033159] alloc_contig_range: [a7b00, a7ee8) PFNs busy
[ 153.039166] alloc_contig_range: [a7c00, a7fe8) PFNs busy

Sometimes, the application even hangs indefinitely. Note that a board with a B0 SOC running the exact same software (kernel + device tree + rootfs + application) doesn't have these symptoms at all. When the CMA size is increased manually to 1280 MiB, the problem still appears.

However, during one of my tests, I tried using the same CMA size as the one used in the i.MX8QXP MEK (960 MiB) and with this size, the behavior is vastly improved. The alloc_contig_range messages aren't completely gone, but they are far less frequent that before and I haven't seen the application hang since.

My two main questions are the following:

Why does a CMA size of 960 MiB seem to improve the issue and a size of 1280 MiB doesn't? This makes me think this is a kind of hardcoded alignment issue in the graphical libraries, but I'm not sure.
Why does this only happen on a C0 SOC and not a B0 one, even though all of the relevant components (GPU driver, GPU libraries, Qt libraries, application) are the same in both cases? Do the GPU libraries do things differently depending on the SOC revision?

Many thanks in advance,
Gabriel

gabrielvalcazar · ‎10-19-2021

Hi @igorpadykov,

We've told our customer that there aren't any differences between B0 and C0 SOCs when it comes to CMA allocation, but they are insistent on understanding the results they're seeing in their tests.

Again, the results are:

B0 SOC + 800x480 display = No CMA warnings
C0 SOC + 800x480 display = No CMA warnings
B0 SOC + 1280x800 display = No CMA warnings
C0 SOC + 1280x800 display = Frequent CMA warnings

I mentioned in my original post that the device also freezes sometimes, but the customer is still investigating if it's related to the CMA warnings, so there's a chance that said warnings might not be as harmful as we thought at first.

Having said this,

Are there any hardware/software differences in the GPU/display subsystem that _might_ be related to the CMA warnings, even if inderectly? I'm aware that you've mentioned several times that there shouldn't be, but the client is interested in understanding the overall differences between the SOCs to make better sense of their issue.
Besides long duration tests, is there any general way to confirm that the CMA warnings are harmless? In case there is no workaround for the warnings.

Many thanks,
Gabriel

igorpadykov · ‎10-19-2021

Hi Gabriel

I sent all your comments to internal team, no feedback so far.

Best regards
igor

gabrielvalcazar · ‎10-25-2021

Hi Igor,

Thanks for forwarding my information to the team. I have a small update, the customer has tested lowering the screen resolution from 1280 to 1200 and the warnings still appear. Find the boot log with memblock=debug enabled attached in case it can shed some light on the issue.

igorpadykov · ‎10-26-2021

below answer from team:

------------------

I checked the log. No much more finding. But looks like B0 board has PCIe/BT/WIFI enabled, but not on C0 board. Could we aligned the hardware first?

You mentioned customer is wondering CMA warning is harmful or not, It's not harmful. But if customer met freeze issue, please provide the details and logcat for freeze issue.

------------------

Best regards
igor

gabrielvalcazar · ‎11-29-2021

Hi @igorpadykov ,

The customer that is experiencing this issue did more thorough testing, using different SOC/display/resolution combinations (and with other components ruled out, such as PCIe). It seems like the issue also happens on B0 chips using smaller displays, the only difference being that the CMA messages are only visible via the syslog, not via the kernel log:

2021-10-18T15:12:13.727781+02:00 ccimx8x-sbc-express kernel: [260636.790555] cma: cma_alloc: alloc failed, req-size: 1000 pages, ret: -16
2021-10-18T15:12:13.728515+02:00 ccimx8x-sbc-express weston[881]: ioctl c0184900 failed with code -1: Cannot allocate memory
2021-10-18T15:12:13.729038+02:00 ccimx8x-sbc-express weston[881]: ion_alloc_fd failed.
2021-10-18T15:12:13.729278+02:00 ccimx8x-sbc-express weston[881]: Fail to allocate physical memory for tmp buffer!

These messages appear periodically and, sometimes, the Qt app crashes after running for some time (even when idle). Judging by the syslog and the fact that smaller resolutions are less prone to fail, it seems like the problem is in weston.

Given that we use the NXP fork of weston with no modifications, is this a known issue? It's also worth noting that the Qt app is being launched directly as a weston desktop, rather than launching the default desktop first and then launching the app, but I don't think this should influence CMA usage or how weston renders objects on the display.

Another thing worth noting is that I tried searching for the error message that appears ("Fail to allocate physical memory for tmp buffer!") in any package/repository that might be involved (weston, wayland, linux, even the pre-compiled imx-gpu-viv libraries), but I couldn't find it anywhere. Where could this error be triggered from? From what I understand, it has something to do with the ION allocator (which I've seen used in other packages like gstreamer), but I haven't found anything related to ION allocation in any weston-related code.

Attached is a .zip file containing a log where the error can be seen and several logs for the different combinations (normal kernel logs, kernel logs with "memblock=debug" enabled and syslogs) to see if there are any notable differences between the different configurations. So far, I haven't been able to find any.

Thanks again and best regards,

Gabriel

igorpadykov · ‎11-29-2021

Hi Gabriel

afaik your fae is also working internally on that case, from that time please continue to work with him

to avoid duplication.

Best regards
igor

alfonsomartin · ‎07-27-2021

Hi all:

Any recommendation/guidance to understand those messages from the kernel?

Regards

Alfonso

igorpadykov · ‎07-27-2021

Hi Alfonso

currently team is investigating this issue. Could you please confirm that

rev.C0 boards passed latest version of ddr test (and image was updated accordingly using ddr test

documentation) using latest RPA tool and SCFW Porting Kit v.1.7.0 (necessary for Linux 5.4.70)
https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/i-MX-8-8X-Family-DDR-Tools-Release/ta-p/...

:https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/i-MX8QXP-DXP-DX-DDR-Register-Programming...

> both running the same software (except for the imx-boot binary, which requires a different image depending on the SOC revision).

just for test one can try to rebuild all image from scratch for rev.C0 option. Also recommended to check if all

used libraries were aligned for L5.4.70 2.3.3 as described in Documentation

Qt 5.15,imx-seco-3.7.4,firmware-imx-8.10.bin,imx-gpu-viv-6.4.3.p1.0.. libraries

Best regards
igor

HectorPalacios · ‎08-02-2021

Hi @igorpadykov

Could you please confirm that rev.C0 boards passed latest version of ddr test (and image was updated accordingly using ddr test documentation) using latest RPA tool and SCFW Porting Kit v.1.7.0

Yes, DDR was tested with params from RPA v16 and SCFW 1.7.1.1.

Also recommended to check if all used libraries were aligned for L5.4.70 2.3.3:
Qt 5.15,imx-seco-3.7.4,firmware-imx-8.10.bin,imx-gpu-viv-6.4.3.p1.0.. libraries

We're using:

Qt 5.11 (this was a requirement that we cannot change)
imx-seco-3.7.5
firmware-imx-8.11
imx-gpu-viv-6.4.3.p1.2

Thanks

igorpadykov · ‎08-02-2021

Hi Hector

"alloc_contig_range ... busy" is just a warning for failed to get continuous physical free memory
in first iterate in CMA range, not an error. If there's any cma_alloc failure kernel log, then it would cause problems.

For details one can look at

https://bugzilla.redhat.com/show_bug.cgi?id=1387793

https://www.spinics.net/lists/arm-kernel/msg535191.html

https://source.codeaurora.org/external/imx/linux-imx/commit/?h=imx_4.19.35_1.0.0&id=8bb01644fd821134...

https://source.codeaurora.org/external/imx/linux-imx/commit/?h=imx_4.19.35_1.0.0&id=147947e73ae47dd0...

https://source.codeaurora.org/external/imx/linux-imx/commit/?h=imx_4.19.35_1.0.0&id=75dddef32514f7aa...

Best regards
igor

HectorPalacios · ‎08-03-2021

Hi @igorpadykov

"alloc_contig_range ... busy" is just a warning for failed to get continuous physical free memory
in first iterate in CMA range, not an error. If there's any cma_alloc failure kernel log, then it would cause problems.

As @gabrielvalcazar explained in his initial post:

Sometimes, the application even hangs indefinitely.

so, yes. That's a problem, not a simple warning.

Also, we want to understand:

why this behavior is only reproducible with C0 silicon and not with B0.
why the issue is mitigated when the reserved CMA matches the exact default value of the MEK device tree (960M) but not when the CMA is less or more than that.

Thanks

igorpadykov · ‎08-03-2021

Hi Hector @HectorPalacios

your issue was escalated internally, below answer fom team:

------------------

There's no big differences in kernel for B0 vs C0, especially none in memory side. I do not think it's caused by different SoC version. Please let me know what's the application of customer doing? The CMA log is just a warning for no continuous memory in CMA range, which may trigger the page migrations.

------------------

Best regards
igor

HectorPalacios · ‎08-10-2021

Hi @igorpadykov

There's no big differences in kernel for B0 vs C0, especially none in memory side. I do not think it's caused by different SoC version.

Still, the customer doesn't see these issues using the same software on the the same hardware when the SOC is a B0. The issues only appear when the SOC is a C0. Maybe there's something on the way U-Boot or the SCFW configure something.

Please let me know what's the application of customer doing?

They are using a Qt based application. The user interface has different controls and sliders. There is CAN communications and I think they occasionally play videos on a 10" display.

PS

(I apologize for the post showing at the end. I pressed the reply button under the thread but it didn't post it where it should).

igorpadykov · ‎08-10-2021

Hi Hector

from team:

---------------

Please add "memblock=debug" into the kernel command line and compare the boot log between B0 and C0, see if there's any differences on the memory layout and reservations.

---------------

Best regards
igor

gabrielvalcazar · ‎09-09-2021

Hi all,

Apologies for the late reply. We obtained some logs for the B0 and C0 boards, and although there is a small offset between the two, the sizes used appear to be pretty much the same. There are also some messages that are dropped by the kernel, is there a way to avoid this from happening so we can see all of the output?

Find the logs attached to this comment.

Best regards,

Gabriel

igorpadykov · ‎09-09-2021

from team :

-----------

No much differences can been found here. Could you please capture the kernel boot log w/o memblock=debug, and check where (base, offset) does CMA mapping?

-----------

Best regards
igor

gabrielvalcazar · ‎09-22-2021

Hi Igor,

Apologies for the delay. Here are the logs from a B0 target and a C0 one, without memblock=debug.

At first glance, it seems to me like CMA region is reserved in the same offset on both targets.

igorpadykov · ‎09-23-2021

answer from team:

---------------

I doubt if the two boards are same, could you double confirmed with customer, if the B0/C0 related board using the same LCD panel, same resolution, same eMMC?

From the boot log, seems the eMMC on C0 board cannot reach to HS400 speed, but B0 can.

---------------

Best regards
igor

gabrielvalcazar · ‎09-23-2021

Hi Igor,

You are correct, the customer was using different setups (eMMC and display, mainly). They came back to us with corrected logs: a B0 and a C0, but with the same setup.

Let me know if you need anything else with this setup (such as the logs with memblock=debug enabled). Now that both SOC revisions are running on the same environment, it would be interesting for the customer to try to reproduce the issue again, as it might be caused by the eMMC/display instead of by the SOC revision.

igorpadykov · ‎09-23-2021

team:

------------------

What's the display's resolution differences? Display resolution on C0 is larger than B0?

Generally speaking, memory usage will be bigger under bigger resolution, and easy for CMA out of memory.

-----------------

Best regards
igor

i.MX8X C0 CMA errors during Qt application, not present in B0

i.MX8X C0 CMA errors during Qt application, not present in B0

i.MX 8 Family | i.MX 8QuadMax (8QM) | 8QuadPlus

Linux