imx8mp gpu memory allocation kernel panic on 6.1.22 and 6.1.55

KobusGvid · ‎05-29-2024

We are experiencing a memory dump kernel panic when running AI models on our imx8mp based board, similar to what is described in p1739413 and p1753940. The panic seems to happen when the processes are using a lot of system memory, and I start a process which needs GPU memory.

I can get the error to trigger quite reliably by first starting VSCODE server on the device (to use up a fair chunk of system memory), and then running the following gstreamer pipeline:

gst-launch-1.0 videotestsrc ! video/x-raw,format=YUY2,width=1920,height=1080 ! vpuenc_h264 ! vpudec ! queue ! imxvideoconvert_g2d rotation=2 ! video/x-raw,width=800,height=480 ! videoconvert ! perf ! fakesink

I've found that depending on system load, I can even get the panic when just running g2d_basic_test.

Note that when this error triggers, there is still ample system memory available. Often 800M >. I'm guessing it just can't get a contiguous region big enough...

Our board is based on the Karp QSXP module, but we run the standard NXP kernel. The module has 2GB LPDDR4.
From the posts it seems the issue relates to the CMA and GPU reserved memory. I've looked at the device-tree changes between the imx8mp-evk and ours, and they are minimal:

// OURS
//
memory@40000000 {
	device_type = "memory";
	reg = <0x0 0x40000000 0 0x40000000>,
	      <0x1 0x00000000 0 0x40000000>;
};

&resmem {
	/* overwrite freescale cma setting since it's not allocatable on qsxp */
	linux,cma {
		size = <0 0x1e000000>;
		/delete-property/ alloc-ranges;
	};

	gpu_reserved: gpu_reserved@100000000 {
		no-map;
		reg = <0x1 0x00000000 0 0x10000000>;
	};
};


// EVK
//
memory@40000000 {
	device_type = "memory";
	reg = <0x0 0x40000000 0 0xc0000000>,
	      <0x1 0x00000000 0 0xc0000000>;
};

resmem: reserved-memory {
	#address-cells = <2>;
	#size-cells = <2>;
	ranges;

	/* global autoconfigured region for contiguous allocations */
	linux,cma {
		compatible = "shared-dma-pool";
		reusable;
		size = <0 0x3c000000>;
		alloc-ranges = <0 0x40000000 0 0xC0000000>;
		linux,cma-default;
	};

	gpu_reserved: gpu_reserved@100000000 {
		no-map;
		reg = <0x1 0x00000000 0 0x10000000>;
	};
};

What I've tried:

- Adding alloc-ranges to cma. (no change)

- Adding ldb_phy = okay to devicetree. (no change)

- increasing/decreasing CMA size.

-Just makes it more fragile, or makes the VPU operations fail to allocate memory.

- removing CMA from devictree and specifying cma size with kernel parameters.

- this always breaks VPU, irrespective of size.

- replacing memory@40000000 entry with a single 2G entry instead of 2 1G entries.

- make gpu_reserved larger. (No change)

Any help would be appreciated. Is this expected behaviour when the cma can't make space? Is it possible to dedicate memory for gpu_reserved? i.e. don't share the space?
Are the allocation ranges okay? Is there something else in the kernel config I need to enable, or some GPU service I need in the image?

KobusGvid · ‎06-13-2024

Okay, after weeks of searching, I have figured out that the difference between the 5.15 and 6.1 images is the gpu_reserved@100000000 section.
https://github.com/nxp-imx/linux-imx/commit/2d5c743aef9613a209b8256e54730ed5a3066a47
https://github.com/nxp-imx/linux-imx/commit/af10b4943109e588c82dc0aa5dfb58b2e620b551

I was able to verify (on the 6.6 and 6.1 Linux) that the memory issue is gone if I exclude that:
&mix_gpu_ml {
/delete-property/ memory-region;
};

/delete-node/ &gpu_reserved;

For now I can use it this way, without the gpu-memory region.

At this point I suspect that the issue is with uboot. I've noticed that the memory address range specified in the kernel device tree is not used, it seems to be inferred based on what is detected/currently being used:

/* devicetree */
memory@40000000 {
device_type = "memory";
reg = <0x00 0x40000000 0x00 0x40000000 0x01 0x00 0x00 0x40000000>;
};

/* /proc/devicetree */
memory@40000000 {
device_type = "memory";
reg = <0x00 0x40000000 0x00 0x80000000>;
};

So I'm guessing the kernel just detects the memory setup from uboot, instead of reconfiguring it. If so, probably NXP changed their uboot memory configuration at the same point this change was introduced in the kernel.

View solution in original post

KobusGvid · ‎06-13-2024

Okay, after weeks of searching, I have figured out that the difference between the 5.15 and 6.1 images is the gpu_reserved@100000000 section.
https://github.com/nxp-imx/linux-imx/commit/2d5c743aef9613a209b8256e54730ed5a3066a47
https://github.com/nxp-imx/linux-imx/commit/af10b4943109e588c82dc0aa5dfb58b2e620b551

I was able to verify (on the 6.6 and 6.1 Linux) that the memory issue is gone if I exclude that:
&mix_gpu_ml {
/delete-property/ memory-region;
};

/delete-node/ &gpu_reserved;

For now I can use it this way, without the gpu-memory region.

At this point I suspect that the issue is with uboot. I've noticed that the memory address range specified in the kernel device tree is not used, it seems to be inferred based on what is detected/currently being used:

/* devicetree */
memory@40000000 {
device_type = "memory";
reg = <0x00 0x40000000 0x00 0x40000000 0x01 0x00 0x00 0x40000000>;
};

/* /proc/devicetree */
memory@40000000 {
device_type = "memory";
reg = <0x00 0x40000000 0x00 0x80000000>;
};

So I'm guessing the kernel just detects the memory setup from uboot, instead of reconfiguring it. If so, probably NXP changed their uboot memory configuration at the same point this change was introduced in the kernel.

KobusGvid · ‎06-03-2024

@Bio_TICFSL I have tried resetting to default imx8mp-evk kernel config, but nothing changed.

Do you perhaps have any further tips I can try?

Bio_TICFSL · ‎05-29-2024

Hello,

Yes this is because you are using a non-supported kernel and the GPU has to be used a module, but for better trace of the GPU, you can use the 6.6.3v kernel that is official from nxp.

https://www.nxp.com/design/design-center/software/embedded-software/i-mx-software/embedded-linux-for...

Regards

KobusGvid · ‎05-30-2024

Thanks for the reply. I am using a supported kernel, the Linux 6.1.55_2.2.1 release.

I assume you mean CONFIG_MXC_GPU_VIV=m instead of =y? I checked the arch/arm64/configs/imx_v8_defconfig file, and it uses CONFIG_MXC_GPU_VIV=y