gpu-viv bug when using PID namespaces

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

gpu-viv bug when using PID namespaces

1,295 Views
martinetd
Contributor IV

Hello,

This is a follow-up for https://community.nxp.com/t5/i-MX-Processors/libGAL-segfaults-when-it-s-PID1/m-p/1388607 which has been stale for a month. It is complicated and was not run on BSP so I wanted to restart fresh with a new post.

 

Hardware: imx8mp evk SCH-46370 REV A1 with 8MPLUSLPD4 CPU board (rev x1)

Software: LF_v5.10.72-2.2.0_images_IMX8MPEVK BSP with 5.10.72-lts-5.10.y+ga68e31b63f86 kernel as is

 

Running the following commands lead to a segfault on the second command:

unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true > /dev/null &

unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true

with traces as follow:

 

 

Core was generated by `./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true'.
Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x0000ffff964177f0 in ?? () from /usr/lib/libGAL.so
#1  0x0000ffff96417a44 in ?? () from /usr/lib/libGAL.so
#2  0x0000ffff9644df9c in ?? () from /usr/lib/libGAL.so
#3  0x0000ffff964064bc in gcoVX_CreateHW () from /usr/lib/libGAL.so
#4  0x0000ffff964066b0 in gcoVX_Construct () from /usr/lib/libGAL.so
#5  0x0000ffff964068dc in gcoVX_SwitchContext () from /usr/lib/libGAL.so
#6  0x0000ffff975440d0 in ?? () from /usr/lib/libOpenVX.so.1
#7  0x0000ffff97798eb8 in vsi_nn_CreateContext () from /usr/lib/libovxlib.so.1.1.0
#8  0x0000ffff97bb6798 in nnrt::Execution::Execution(nnrt::Compilation*) () from /usr/lib/libnnrt.so.1
#9  0x0000ffff97cda76c in ANeuralNetworksExecution_create () from /usr/lib/libneuralnetworks.so
#10 0x0000ffff981dc684 in tflite::delegate::nnapi::NNAPIDelegateKernel::Invoke(TfLiteContext*, TfLiteNode*, int*) () from /usr/lib/libtensorflow-lite.so.2.6.0
#11 0x0000ffff981ce524 in tflite::Subgraph::Invoke() () from /usr/lib/libtensorflow-lite.so.2.6.0
#12 0x0000ffff98388590 in tflite::Interpreter::Invoke() () from /usr/lib/libtensorflow-lite.so.2.6.0
#13 0x0000aaaabb45fc6c in ?? ()
#14 0x0000aaaabb462c10 in ?? ()
#15 0x0000aaaabb45de68 in ?? ()
#16 0x0000ffff97d64994 in __libc_start_main (main=0xaaaabb45d780, argc=3, argv=0xfffff4999f88, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=<optimized out>) at ../csu/libc-start.c:332
#17 0x0000aaaabb45dbb8 in ?? ()

 

 

 

This is because the gpu-viv driver allocates resources based on PID number, and if two processes with resources share the same PID the process won't handle it well.

 

I was able to work around this issue with the following patch, also accessible here: https://github.com/atmark-techno/linux-5.10-at/commit/b4de9635b00ba52fafc35b953f20260eb78f593e

 

 

From b53c1a5bcc28db552dcc28fbb52289b5d043396c Mon Sep 17 00:00:00 2001
From: Dominique Martinet <dominique.martinet@atmark-techno.com>
Date: Tue, 8 Feb 2022 16:07:12 +0900
Subject: [PATCH] gpu-viv: make galcore functions use the global init pid
 namespace

using the container namespace leads to crashes when multiple processes
have the same PID

diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
index 852b2f552460..2de1d984cc99 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
@@ -97,7 +97,7 @@ typedef va_list gctARGUMENTS;
 
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
 #   define gcmkGETPROCESSID() \
-        task_tgid_vnr(current)
+        task_tgid_nr(current)
 #else
 #   define gcmkGETPROCESSID() \
         current->tgid
@@ -105,7 +105,7 @@ typedef va_list gctARGUMENTS;
 
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
 #   define gcmkGETTHREADID() \
-        task_pid_vnr(current)
+        task_pid_nr(current)
 #else
 #   define gcmkGETTHREADID() \
         current->pid
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
index a436edb11d9a..57b0629569aa 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
@@ -330,7 +330,7 @@ _GetProcessID(
     )
 {
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
-    return task_tgid_vnr(current);
+    return task_tgid_nr(current);
 #else
     return current->tgid;
 #endif
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
index 5532efadd1e1..a0b274a35288 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
@@ -109,7 +109,7 @@ _GetThreadID(
     )
 {
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
-    return task_pid_vnr(current);
+    return task_pid_nr(current);
 #else
     return current->pid;
 #endif

 

 

 

I'd appreciate acknowledgement about this issue, as well as a further analysis of the possible side-effects my patch would have as I have no way of checking what the closed source libGAL and other gpu-viv libraries do with that PID (e.g. there could be unwanted side-effects if they notice the PID doesn't match somewhere)

Thank you.

0 Kudos
Reply
4 Replies

1,285 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello martinetd,

 

We know about this issue on previous BSP, but is suppose to be fixed in 5.10.72_2.2.0, if you already present the fail please let is know.

 

Regards

0 Kudos
Reply

1,277 Views
martinetd
Contributor IV

Hello @Bio_TICFSL 

The trace I gave here was on 5.10.72_2.2.0 BSP.

In the previous post, I incorrectly said this was fixed in this version because benchmark_model default changed from npu to cpu, and this doesn't happen when using cpu, but using npu fails the same way.

 

Thank you!

0 Kudos
Reply

1,270 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi,

 

ok, this will need to go with developers, it will take time to answer is possible to appears fixed in next release of the BSP.

 

Regards

1,263 Views
martinetd
Contributor IV

Thank you! Please do not hesitate to reach out to me if you or developers have any question.

0 Kudos
Reply