libGAL segfaults when it's PID1?

martinetd · ‎12-17-2021

Hello,

I've got a bit of a weird bug, with no simple reproducer (yet), but would be interested if someone who can see the source could have a first look to see if there would be weird check for which pid the current process is as I cannot see anything else.

My setup is just starting a single tensorflow/libneuralnetwork application in a container like this, which segfaults 100% of time during init:

podman run -ti --name camera --replace  --env=XDG_RUNTIME_DIR=/run/xdg_home --device=/dev/video3 --device=/dev/galcore --volume=/tmp/xdg_home:/run/xdg_home localhost/demo python3 /root/demo_app/detect_object.py

Whereas just wrapping the command in sh works: (note using 'exec python3' here also segfaults, so it really seems related to being PID1?)

podman run -ti --name camera --replace  --env=XDG_RUNTIME_DIR=/run/xdg_home --device=/dev/video3 --device=/dev/galcore --volume=/tmp/xdg_home:/run/xdg_home localhost/demo sh -c "python3 /root/demo_app/detect_object.py"

Here's a backtrace and some limited traces (registers, disassembly around the fault address) -- note we're using imx-gpu-viv-6.4.3.p1.2-aarch64 (from the 5.10.9_1.0.0 release) on debian, as newer versions require symbols from glibc 2.33 which is not available for us. kernel is (similar to) lf-5.10.72-2.2.0.

#0  0x0000ffffa6f36c40 in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#1  0x0000ffffa6f36e94 in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#2  0x0000ffffa6f6cb6c in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#3  0x0000ffffa6f2595c in gcoVX_CreateHW () from /usr/lib/aarch64-linux-gnu/libGAL.so
#4  0x0000ffffa6f25b50 in gcoVX_Construct () from /usr/lib/aarch64-linux-gnu/libGAL.so
#5  0x0000ffffa6f25d7c in gcoVX_SwitchContext () from /usr/lib/aarch64-linux-gnu/libGAL.so
#6  0x0000ffffa804af20 in ?? () from /usr/lib/aarch64-linux-gnu/libOpenVX.so.1
#7  0x0000ffffa8290358 in vsi_nn_CreateContext () from /usr/lib/aarch64-linux-gnu/libovxlib.so.1.1.0
#8  0x0000ffffa86238d4 in nnrt::Execution::Execution(nnrt::Compilation*) () from /usr/lib/aarch64-linux-gnu/libnnrt.so.1.1.9
#9  0x0000ffffa8742454 in ANeuralNetworksExecution_create () from /usr/lib/aarch64-linux-gnu/libneuralnetworks.so.1
#10 0x0000ffffa8ce5204 in tflite::delegate::nnapi::NNAPIDelegateKernel::Invoke(TfLiteContext*, TfLiteNode*, int*) ()
   from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#11 0x0000ffffa8d02e80 in tflite::Subgraph::Invoke() () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#12 0x0000ffffa8c2c094 in tflite::Interpreter::Invoke() () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#13 0x0000ffffa8c0ee3c in tflite::interpreter_wrapper::InterpreterWrapper::Invoke() ()
   from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#14 0x0000ffffa8c134d8 in ?? () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#15 0x0000ffffa8c251b8 in ?? () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#16 0x00000000004cac54 in ?? ()
#17 0x00000000004a5300 in _PyObject_MakeTpCall ()
#18 0x00000000004c6cc8 in ?? ()
#19 0x000000000049c258 in _PyEval_EvalFrameDefault ()
#20 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#21 0x0000000000498218 in _PyEval_EvalFrameDefault ()
#22 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#23 0x0000000000498064 in _PyEval_EvalFrameDefault ()
#24 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#25 0x0000000000498064 in _PyEval_EvalFrameDefault ()
#26 0x00000000004964f8 in ?? ()
#27 0x0000000000496290 in _PyEval_EvalCodeWithName ()
#28 0x00000000005976fc in PyEval_EvalCode ()
#29 0x00000000005c850c in ?? ()
#30 0x00000000005c2520 in ?? ()
#31 0x00000000005c8458 in ?? ()
#32 0x00000000005c7c38 in PyRun_SimpleFileExFlags ()
#33 0x00000000005b7afc in Py_RunMain ()
#34 0x0000000000587638 in Py_BytesMain ()
#35 0x0000ffffb0a96218 in __libc_start_main (main=0x587538 <_start+56>, argc=11, argv=0xffffc90bd6e8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:308
#36 0x0000000000587534 in _start ()

(gdb) info proc map 
Mapped address spaces:
...
      0xffffa6ec8000     0xffffa7067000   0x19f000        0x0 /usr/lib/aarch64-linux-gnu/libGAL.so
      0xffffa7067000     0xffffa7076000     0xf000   0x19f000 /usr/lib/aarch64-linux-gnu/libGAL.so
      0xffffa7076000     0xffffa7078000     0x2000   0x19e000 /usr/lib/aarch64-linux-gnu/libGAL.so
      0xffffa7078000     0xffffa7089000    0x11000   0x1a0000 /usr/lib/aarch64-linux-gnu/libGAL.so


(gdb) info reg
x0             0x801028a           134283914
x1             0x0                 0
x2             0x28a               650
x3             0x2270b600          577811968
x4             0xffffc90bbba8      281474054732712
x5             0x4f3bf83c          1329330236
x6             0x8d4d90            9260432
x7             0x0                 0
x8             0x21f5d010          569757712
x9             0x2270b7f0          577812464
x10            0x0                 0
x11            0x0                 0
x12            0x0                 0
x13            0x1                 1
x14            0x2                 2
x15            0x20                32
x16            0xffffa7076de0      281473484025312
x17            0xffffa6ef19e0      281473482430944
x18            0x0                 0
x19            0x22724940          577915200
x20            0x28a               650
x21            0x228f8650          579831376
x22            0x1                 1
x23            0x1                 1
x24            0x2286a1a0          579248544
x25            0xffff79cfc3a0      281472725402528
x26            0xffffc90bbbdc      281474054732764
x27            0x0                 0
x28            0x0                 0
x29            0xffffc90bbb20      281474054732576
x30            0xffffa6f36c10      281473482714128
sp             0xffffc90bbb20      0xffffc90bbb20
pc             0xffffa6f36c40      0xffffa6f36c40
cpsr           0x60001000          [ EL=0 SSBS C Z ]
fpsr           0x8000010           134217744
fpcr           0x0                 0


(gdb) disass 0x0000ffffa6f36c40, 0x0000ffffa6f36c80 
Dump of assembler code from 0xffffa6f36c40 to 0xffffa6f36c80:
=> 0x0000ffffa6f36c40:	str	w0, [x25], #4
   0x0000ffffa6f36c44:	adrp	x26, 0xffffa703e000
   0x0000ffffa6f36c48:	add	x1, x26, #0xe98
   0x0000ffffa6f36c4c:	add	x0, sp, #0x90
   0x0000ffffa6f36c50:	add	w24, w22, w20
   0x0000ffffa6f36c54:	sub	x23, x5, #0x4
   0x0000ffffa6f36c58:	mov	x26, x25
   0x0000ffffa6f36c5c:	mov	w27, #0xc                   	// #12
   0x0000ffffa6f36c60:	str	x0, [sp, #104]
   0x0000ffffa6f36c64:	str	x1, [sp, #120]
   0x0000ffffa6f36c68:	b	0xffffa6f36c94
   0x0000ffffa6f36c6c:	add	x0, x21, x2
   0x0000ffffa6f36c70:	ldr	w1, [sp, #100]
   0x0000ffffa6f36c74:	str	w20, [x21, x2]
   0x0000ffffa6f36c78:	stp	w1, w4, [x0, #4]
   0x0000ffffa6f36c7c:	ldr	w1, [x19, #12]

When I have time I'll try to reproduce this on a yocto build, with updated versions, but that might take a while. I'm sorry I can't share the python code either, so will have to start with a reproducer I can share...

Anyway, please let me know if you have an idea!

Bio_TICFSL · ‎12-21-2021

Hello martinetd,

Have you tried to run other demos, for example, label_image.py coming with default BSP, in this way? Is this issue only happened in this case?

Regards

martinetd · ‎01-12-2022

Hello @Bio_TICFSL ,

unfortunately it looks like my statement that this was fixed in the newer (5.10.72-2.2.0) BSP was flawed... because label_image.py used to be an NPU workload and no longer is on the newer bsp!!! And if NPU isn't used it no longer crashes... (Confirmed this with gmem_info now I've heard about it, and also warmup and running times are totally different)

Using ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true" instead does run on NPU and crashes the same on both images.

Now, with gmem_info we can observe that the PID number listed is that of the one inside the container, below a snapshot where pid 2 does not exist (so no command) in the container where the command is used:

bash-5.1# gmem_info 
 Pid          Total      Reserved    Contiguous       Virtual      Nonpaged    Name
    2    10,664,960    10,394,944       270,016             0             0    
    1     2,547,968     1,966,336       581,632             0             0    /sbin/init
 ------------------------------------------------------------------------------
    2    13,212,928    12,361,280       851,648             0             0    Summary
    -             -   256,074,176             -             -             -    Available
GPU Idle time:  0.000000 ms

That gave me the idea of running two containers each having a NPU workload on PID 2, and surely enough the second process to start segfaults as well.

So:

- presumably the problem isn't in userspace, but some collision in a shared PID table. Looking at the gpu-viv kernel driver the PID is used in various places, and updating these to use the global PID table instead of the local namespace's pid might fix the issue. Since the kernel source is available I might try messing with it when I have time later, but registering the issue would be appreciated.

- while updating doesn't fix this issue, I'd still like to update - would you have an answer about my question for glibc?

Thanks!

martinetd · ‎01-13-2022

> - presumably the problem isn't in userspace, but some collision in a shared PID table. Looking at the gpu-viv kernel driver the PID is used in various places, and updating these to use the global PID table instead of the local namespace's pid might fix the issue. Since the kernel source is available I might try messing with it when I have time later, but registering the issue would be appreciated.

So I can at least confirm this lead, that ugly patch of using init namespace pid instead of namespace-local pid allows my application starting as pid 1:

diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
index 852b2f552460..322045fbd2cb 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
@@ -97,7 +97,7 @@ typedef va_list gctARGUMENTS;
 
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
 #   define gcmkGETPROCESSID() \
-        task_tgid_vnr(current)
+        task_tgid_nr(current)
 #else
 #   define gcmkGETPROCESSID() \
         current->tgid
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
index a436edb11d9a..57b0629569aa 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
@@ -330,7 +330,7 @@ _GetProcessID(
     )
 {
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
-    return task_tgid_vnr(current);
+    return task_tgid_nr(current);
 #else
     return current->tgid;
 #endif

unfortunately it also breaks gmem_info, (I'd have assumed pids to just be a bit off but nothing shows at all), so something a bit less absolute is required to map either pid to their right value in correct locations...

martinetd · ‎12-23-2021

Thanks for your suggestion!

I was able to run label_image.py from the lf-5.10.72-2.2.0 bsp with our kernel, so the update might have fixed it.

Unfortunately, the libGAL.so in lf-5.10.72-2.2.0 is not usable on our containers: it depends on glibc 2.33 which is too recent for debian bullseye which we use.

As these are shipped only as binaries I'm out of luck to use these, I'll try with an older bsp to confirm the problem lies in the old version and isn't something else in my environment.

Not being able to update though is a problem regardless of this bug: would you happen to have any suggestion on how to run imx-gpu-viv-6.4.3.p2.4-aarch64 or similar newer versions on debian? If sources were supplied we would be happy to rebuild everything, but there's not much to do with binaries...

Bio_TICFSL · ‎01-14-2022

Hello,

This problem just happens in debian? because debian is not supported, is just supported Yocto.and in yocto it works

Regards

martinetd · ‎01-16-2022

Hello @Bio_TICFSL , thanks for the reply!

You've replied here to one of my older posts, and there are two issues at hand, let me recap:

- the missing glibc symbol when trying to use the latest gpu-viv userspace binaries on debian is definitely a debian problem (could also happen on older yocto but that's just as much not supported) -- I'd appreciate help here but it's a bit off topic, and as you pointed out it's not supported, please consider it more as a user request to broaden support eventually as there are users but let's drop the issue here. I'll open a new thread after a bit more tests on this, it's best to keep issues separate.

- (there's also a side bug that label_image.py in the bsp used to do a NPU computation but now is 100% cpu; that made me wrongly think the bug was not present in the newer bsp, but it definitely is when running with benchmark_model --use_nnapi=true. I don't particularly care about this bug, but you might as performance is quite different and it might give a false impression of NPU being slow.)

- the segfault when using NPU applications when there is a PID collision - I've been using the BSP image inside containers as is (imx-image-full-imx8mpevk.tar.bz2 from LF_v5.10.72-2.2.0_images_IMX8MPEVK). I have a recompiled kernel for full container support but it's basically the same as the one provided as well, now I've trimmed down the issue I think you should be able to reproduce with the following command in the stock kernel:

cd /usr/bin/tensorflow-lite-2.6.0/examples && unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true

which will end in segmentation fault, because PID 1 already has resources associated with the NPU.

It's also possible to make another PID fail when multiple processes using NPU share the same pid, for example with the following command in the same directory twice in two different terminals:

unshare -f -p sh -c './benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true & wait'

The first one will work and the second one will immediately segfault without any message because they both use the same pid 2 inside their namespaces.

I provided a patch that makes the kernel driver use the init namespace PID (globally unique) instead of the current namespace's inside the gpu viv driver, which allows applications to run successfully in this case, but breaks tools like gmem_info so is not acceptable as is. Please advise how to properly fix this issue.

Thank you,

Dominique Martinet

libGAL segfaults when it's PID1?

libGAL segfaults when it's PID1?

i.MX 8M | i.MX 8M Mini | i.MX 8M Nano