Hello,
I've got a bit of a weird bug, with no simple reproducer (yet), but would be interested if someone who can see the source could have a first look to see if there would be weird check for which pid the current process is as I cannot see anything else.
My setup is just starting a single tensorflow/libneuralnetwork application in a container like this, which segfaults 100% of time during init:
podman run -ti --name camera --replace --env=XDG_RUNTIME_DIR=/run/xdg_home --device=/dev/video3 --device=/dev/galcore --volume=/tmp/xdg_home:/run/xdg_home localhost/demo python3 /root/demo_app/detect_object.py
Whereas just wrapping the command in sh works: (note using 'exec python3' here also segfaults, so it really seems related to being PID1?)
podman run -ti --name camera --replace --env=XDG_RUNTIME_DIR=/run/xdg_home --device=/dev/video3 --device=/dev/galcore --volume=/tmp/xdg_home:/run/xdg_home localhost/demo sh -c "python3 /root/demo_app/detect_object.py"
Here's a backtrace and some limited traces (registers, disassembly around the fault address) -- note we're using imx-gpu-viv-6.4.3.p1.2-aarch64 (from the 5.10.9_1.0.0 release) on debian, as newer versions require symbols from glibc 2.33 which is not available for us. kernel is (similar to) lf-5.10.72-2.2.0.
#0 0x0000ffffa6f36c40 in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#1 0x0000ffffa6f36e94 in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#2 0x0000ffffa6f6cb6c in ?? () from /usr/lib/aarch64-linux-gnu/libGAL.so
#3 0x0000ffffa6f2595c in gcoVX_CreateHW () from /usr/lib/aarch64-linux-gnu/libGAL.so
#4 0x0000ffffa6f25b50 in gcoVX_Construct () from /usr/lib/aarch64-linux-gnu/libGAL.so
#5 0x0000ffffa6f25d7c in gcoVX_SwitchContext () from /usr/lib/aarch64-linux-gnu/libGAL.so
#6 0x0000ffffa804af20 in ?? () from /usr/lib/aarch64-linux-gnu/libOpenVX.so.1
#7 0x0000ffffa8290358 in vsi_nn_CreateContext () from /usr/lib/aarch64-linux-gnu/libovxlib.so.1.1.0
#8 0x0000ffffa86238d4 in nnrt::Execution::Execution(nnrt::Compilation*) () from /usr/lib/aarch64-linux-gnu/libnnrt.so.1.1.9
#9 0x0000ffffa8742454 in ANeuralNetworksExecution_create () from /usr/lib/aarch64-linux-gnu/libneuralnetworks.so.1
#10 0x0000ffffa8ce5204 in tflite::delegate::nnapi::NNAPIDelegateKernel::Invoke(TfLiteContext*, TfLiteNode*, int*) ()
from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#11 0x0000ffffa8d02e80 in tflite::Subgraph::Invoke() () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#12 0x0000ffffa8c2c094 in tflite::Interpreter::Invoke() () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#13 0x0000ffffa8c0ee3c in tflite::interpreter_wrapper::InterpreterWrapper::Invoke() ()
from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#14 0x0000ffffa8c134d8 in ?? () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#15 0x0000ffffa8c251b8 in ?? () from /usr/lib/python3/dist-packages/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper.so
#16 0x00000000004cac54 in ?? ()
#17 0x00000000004a5300 in _PyObject_MakeTpCall ()
#18 0x00000000004c6cc8 in ?? ()
#19 0x000000000049c258 in _PyEval_EvalFrameDefault ()
#20 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#21 0x0000000000498218 in _PyEval_EvalFrameDefault ()
#22 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#23 0x0000000000498064 in _PyEval_EvalFrameDefault ()
#24 0x00000000004b1a48 in _PyFunction_Vectorcall ()
#25 0x0000000000498064 in _PyEval_EvalFrameDefault ()
#26 0x00000000004964f8 in ?? ()
#27 0x0000000000496290 in _PyEval_EvalCodeWithName ()
#28 0x00000000005976fc in PyEval_EvalCode ()
#29 0x00000000005c850c in ?? ()
#30 0x00000000005c2520 in ?? ()
#31 0x00000000005c8458 in ?? ()
#32 0x00000000005c7c38 in PyRun_SimpleFileExFlags ()
#33 0x00000000005b7afc in Py_RunMain ()
#34 0x0000000000587638 in Py_BytesMain ()
#35 0x0000ffffb0a96218 in __libc_start_main (main=0x587538 <_start+56>, argc=11, argv=0xffffc90bd6e8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:308
#36 0x0000000000587534 in _start ()
(gdb) info proc map
Mapped address spaces:
...
0xffffa6ec8000 0xffffa7067000 0x19f000 0x0 /usr/lib/aarch64-linux-gnu/libGAL.so
0xffffa7067000 0xffffa7076000 0xf000 0x19f000 /usr/lib/aarch64-linux-gnu/libGAL.so
0xffffa7076000 0xffffa7078000 0x2000 0x19e000 /usr/lib/aarch64-linux-gnu/libGAL.so
0xffffa7078000 0xffffa7089000 0x11000 0x1a0000 /usr/lib/aarch64-linux-gnu/libGAL.so
(gdb) info reg
x0 0x801028a 134283914
x1 0x0 0
x2 0x28a 650
x3 0x2270b600 577811968
x4 0xffffc90bbba8 281474054732712
x5 0x4f3bf83c 1329330236
x6 0x8d4d90 9260432
x7 0x0 0
x8 0x21f5d010 569757712
x9 0x2270b7f0 577812464
x10 0x0 0
x11 0x0 0
x12 0x0 0
x13 0x1 1
x14 0x2 2
x15 0x20 32
x16 0xffffa7076de0 281473484025312
x17 0xffffa6ef19e0 281473482430944
x18 0x0 0
x19 0x22724940 577915200
x20 0x28a 650
x21 0x228f8650 579831376
x22 0x1 1
x23 0x1 1
x24 0x2286a1a0 579248544
x25 0xffff79cfc3a0 281472725402528
x26 0xffffc90bbbdc 281474054732764
x27 0x0 0
x28 0x0 0
x29 0xffffc90bbb20 281474054732576
x30 0xffffa6f36c10 281473482714128
sp 0xffffc90bbb20 0xffffc90bbb20
pc 0xffffa6f36c40 0xffffa6f36c40
cpsr 0x60001000 [ EL=0 SSBS C Z ]
fpsr 0x8000010 134217744
fpcr 0x0 0
(gdb) disass 0x0000ffffa6f36c40, 0x0000ffffa6f36c80
Dump of assembler code from 0xffffa6f36c40 to 0xffffa6f36c80:
=> 0x0000ffffa6f36c40: str w0, [x25], #4
0x0000ffffa6f36c44: adrp x26, 0xffffa703e000
0x0000ffffa6f36c48: add x1, x26, #0xe98
0x0000ffffa6f36c4c: add x0, sp, #0x90
0x0000ffffa6f36c50: add w24, w22, w20
0x0000ffffa6f36c54: sub x23, x5, #0x4
0x0000ffffa6f36c58: mov x26, x25
0x0000ffffa6f36c5c: mov w27, #0xc // #12
0x0000ffffa6f36c60: str x0, [sp, #104]
0x0000ffffa6f36c64: str x1, [sp, #120]
0x0000ffffa6f36c68: b 0xffffa6f36c94
0x0000ffffa6f36c6c: add x0, x21, x2
0x0000ffffa6f36c70: ldr w1, [sp, #100]
0x0000ffffa6f36c74: str w20, [x21, x2]
0x0000ffffa6f36c78: stp w1, w4, [x0, #4]
0x0000ffffa6f36c7c: ldr w1, [x19, #12]
When I have time I'll try to reproduce this on a yocto build, with updated versions, but that might take a while. I'm sorry I can't share the python code either, so will have to start with a reproducer I can share...
Anyway, please let me know if you have an idea!
Hello martinetd,
Have you tried to run other demos, for example, label_image.py coming with default BSP, in this way? Is this issue only happened in this case?
Regards
Hello @Bio_TICFSL ,
unfortunately it looks like my statement that this was fixed in the newer (5.10.72-2.2.0) BSP was flawed... because label_image.py used to be an NPU workload and no longer is on the newer bsp!!! And if NPU isn't used it no longer crashes... (Confirmed this with gmem_info now I've heard about it, and also warmup and running times are totally different)
Using ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true" instead does run on NPU and crashes the same on both images.
Now, with gmem_info we can observe that the PID number listed is that of the one inside the container, below a snapshot where pid 2 does not exist (so no command) in the container where the command is used:
bash-5.1# gmem_info
Pid Total Reserved Contiguous Virtual Nonpaged Name
2 10,664,960 10,394,944 270,016 0 0
1 2,547,968 1,966,336 581,632 0 0 /sbin/init
------------------------------------------------------------------------------
2 13,212,928 12,361,280 851,648 0 0 Summary
- - 256,074,176 - - - Available
GPU Idle time: 0.000000 ms
That gave me the idea of running two containers each having a NPU workload on PID 2, and surely enough the second process to start segfaults as well.
So:
- presumably the problem isn't in userspace, but some collision in a shared PID table. Looking at the gpu-viv kernel driver the PID is used in various places, and updating these to use the global PID table instead of the local namespace's pid might fix the issue. Since the kernel source is available I might try messing with it when I have time later, but registering the issue would be appreciated.
- while updating doesn't fix this issue, I'd still like to update - would you have an answer about my question for glibc?
Thanks!
> - presumably the problem isn't in userspace, but some collision in a shared PID table. Looking at the gpu-viv kernel driver the PID is used in various places, and updating these to use the global PID table instead of the local namespace's pid might fix the issue. Since the kernel source is available I might try messing with it when I have time later, but registering the issue would be appreciated.
So I can at least confirm this lead, that ugly patch of using init namespace pid instead of namespace-local pid allows my application starting as pid 1:
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
index 852b2f552460..322045fbd2cb 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
@@ -97,7 +97,7 @@ typedef va_list gctARGUMENTS;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
# define gcmkGETPROCESSID() \
- task_tgid_vnr(current)
+ task_tgid_nr(current)
#else
# define gcmkGETPROCESSID() \
current->tgid
diff --git a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
index a436edb11d9a..57b0629569aa 100644
--- a/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
+++ b/drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
@@ -330,7 +330,7 @@ _GetProcessID(
)
{
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
- return task_tgid_vnr(current);
+ return task_tgid_nr(current);
#else
return current->tgid;
#endif
unfortunately it also breaks gmem_info, (I'd have assumed pids to just be a bit off but nothing shows at all), so something a bit less absolute is required to map either pid to their right value in correct locations...
Thanks for your suggestion!
I was able to run label_image.py from the lf-5.10.72-2.2.0 bsp with our kernel, so the update might have fixed it.
Unfortunately, the libGAL.so in lf-5.10.72-2.2.0 is not usable on our containers: it depends on glibc 2.33 which is too recent for debian bullseye which we use.
As these are shipped only as binaries I'm out of luck to use these, I'll try with an older bsp to confirm the problem lies in the old version and isn't something else in my environment.
Not being able to update though is a problem regardless of this bug: would you happen to have any suggestion on how to run imx-gpu-viv-6.4.3.p2.4-aarch64 or similar newer versions on debian? If sources were supplied we would be happy to rebuild everything, but there's not much to do with binaries...
Hello,
This problem just happens in debian? because debian is not supported, is just supported Yocto.and in yocto it works
Regards
Hello @Bio_TICFSL , thanks for the reply!
You've replied here to one of my older posts, and there are two issues at hand, let me recap:
- the missing glibc symbol when trying to use the latest gpu-viv userspace binaries on debian is definitely a debian problem (could also happen on older yocto but that's just as much not supported) -- I'd appreciate help here but it's a bit off topic, and as you pointed out it's not supported, please consider it more as a user request to broaden support eventually as there are users but let's drop the issue here. I'll open a new thread after a bit more tests on this, it's best to keep issues separate.
- (there's also a side bug that label_image.py in the bsp used to do a NPU computation but now is 100% cpu; that made me wrongly think the bug was not present in the newer bsp, but it definitely is when running with benchmark_model --use_nnapi=true. I don't particularly care about this bug, but you might as performance is quite different and it might give a false impression of NPU being slow.)
- the segfault when using NPU applications when there is a PID collision - I've been using the BSP image inside containers as is (imx-image-full-imx8mpevk.tar.bz2 from LF_v5.10.72-2.2.0_images_IMX8MPEVK). I have a recompiled kernel for full container support but it's basically the same as the one provided as well, now I've trimmed down the issue I think you should be able to reproduce with the following command in the stock kernel:
cd /usr/bin/tensorflow-lite-2.6.0/examples && unshare -f -p ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true
which will end in segmentation fault, because PID 1 already has resources associated with the NPU.
It's also possible to make another PID fail when multiple processes using NPU share the same pid, for example with the following command in the same directory twice in two different terminals:
unshare -f -p sh -c './benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true & wait'
The first one will work and the second one will immediately segfault without any message because they both use the same pid 2 inside their namespaces.
I provided a patch that makes the kernel driver use the init namespace PID (globally unique) instead of the current namespace's inside the gpu viv driver, which allows applications to run successfully in this case, but breaks tools like gmem_info so is not acceptable as is. Please advise how to properly fix this issue.
Thank you,
Dominique Martinet