Hello,
This is a follow-up for https://community.nxp.com/t5/i-MX-Processors/libGAL-segfaults-when-it-s-PID1/m-p/1388607 which has been stale for a month. It is complicated and was not run on BSP so I wanted to restart fresh with a new post.
Hardware: imx8mp evk SCH-46370 REV A1 with 8MPLUSLPD4 CPU board (rev x1)
Software: LF_v5.10.72-2.2.0_images_IMX8MPEVK BSP with 5.10.72-lts-5.10.y+ga68e31b63f86 kernel as is
Running the following commands lead to a segfault on the second command:
unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true > /dev/null &
unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true
with traces as follow:
Core was generated by `./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true'.
Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
#0 0x0000ffff964177f0 in ?? () from /usr/lib/libGAL.so
#1 0x0000ffff96417a44 in ?? () from /usr/lib/libGAL.so
#2 0x0000ffff9644df9c in ?? () from /usr/lib/libGAL.so
#3 0x0000ffff964064bc in gcoVX_CreateHW () from /usr/lib/libGAL.so
#4 0x0000ffff964066b0 in gcoVX_Construct () from /usr/lib/libGAL.so
#5 0x0000ffff964068dc in gcoVX_SwitchContext () from /usr/lib/libGAL.so
#6 0x0000ffff975440d0 in ?? () from /usr/lib/libOpenVX.so.1
#7 0x0000ffff97798eb8 in vsi_nn_CreateContext () from /usr/lib/libovxlib.so.1.1.0
#8 0x0000ffff97bb6798 in nnrt::Execution::Execution(nnrt::Compilation*) () from /usr/lib/libnnrt.so.1
#9 0x0000ffff97cda76c in ANeuralNetworksExecution_create () from /usr/lib/libneuralnetworks.so
#10 0x0000ffff981dc684 in tflite::delegate::nnapi::NNAPIDelegateKernel::Invoke(TfLiteContext*, TfLiteNode*, int*) () from /usr/lib/libtensorflow-lite.so.2.6.0
#11 0x0000ffff981ce524 in tflite::Subgraph::Invoke() () from /usr/lib/libtensorflow-lite.so.2.6.0
#12 0x0000ffff98388590 in tflite::Interpreter::Invoke() () from /usr/lib/libtensorflow-lite.so.2.6.0
#13 0x0000aaaabb45fc6c in ?? ()
#14 0x0000aaaabb462c10 in ?? ()
#15 0x0000aaaabb45de68 in ?? ()
#16 0x0000ffff97d64994 in __libc_start_main (main=0xaaaabb45d780, argc=3, argv=0xfffff4999f88, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=<optimized out>) at ../csu/libc-start.c:332
#17 0x0000aaaabb45dbb8 in ?? ()
This is because the gpu-viv driver allocates resources based on PID number, and if two processes with resources share the same PID the process won't handle it well.
I was able to work around this issue with the following patch, also accessible here: https://github.com/atmark-techno/linux-5.10-at/commit/b4de9635b00ba52fafc35b953f20260eb78f593e
From b53c1a5bcc28db552dcc28fbb52289b5d043396c Mon Sep 17 00:00:00 2001
From: Dominique Martinet <dominique.martinet@atmark-techno.com>
Date: Tue, 8 Feb 2022 16:07:12 +0900
Subject: [PATCH] gpu-viv: make galcore functions use the global init pid
namespace
using the container namespace leads to crashes when multiple processes
have the same PID
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
index 852b2f552460..2de1d984cc99 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
@@ -97,7 +97,7 @@ typedef va_list gctARGUMENTS;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
# define gcmkGETPROCESSID() \
- task_tgid_vnr(current)
+ task_tgid_nr(current)
#else
# define gcmkGETPROCESSID() \
current->tgid
@@ -105,7 +105,7 @@ typedef va_list gctARGUMENTS;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
# define gcmkGETTHREADID() \
- task_pid_vnr(current)
+ task_pid_nr(current)
#else
# define gcmkGETTHREADID() \
current->pid
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
index a436edb11d9a..57b0629569aa 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
@@ -330,7 +330,7 @@ _GetProcessID(
)
{
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
- return task_tgid_vnr(current);
+ return task_tgid_nr(current);
#else
return current->tgid;
#endif
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
index 5532efadd1e1..a0b274a35288 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
@@ -109,7 +109,7 @@ _GetThreadID(
)
{
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
- return task_pid_vnr(current);
+ return task_pid_nr(current);
#else
return current->pid;
#endif
I'd appreciate acknowledgement about this issue, as well as a further analysis of the possible side-effects my patch would have as I have no way of checking what the closed source libGAL and other gpu-viv libraries do with that PID (e.g. there could be unwanted side-effects if they notice the PID doesn't match somewhere)
Thank you.
Hello martinetd,
We know about this issue on previous BSP, but is suppose to be fixed in 5.10.72_2.2.0, if you already present the fail please let is know.
Regards
Hello @Bio_TICFSL
The trace I gave here was on 5.10.72_2.2.0 BSP.
In the previous post, I incorrectly said this was fixed in this version because benchmark_model default changed from npu to cpu, and this doesn't happen when using cpu, but using npu fails the same way.
Thank you!