i.MX8MP NPU crash after some delay

Christophe_Couturier

Hello,

I use an nnstreamer pipeline to run a detection model on video streams on an i.MX8MP (scarthgap 6.6.23).

It works fine. But after a random delay (~ some 10 seconds to some minutes), the rate of the inferences' output suddenly dramatically drops (typically from 15 to 0.3 FPS on our proprietary model).
At the same time, I observe that the load of the 2nd GPU (GC8000) raises at 100% and sticks to 100% until the pipeline stops.
I also notice weird display bugs on the weston's desktop screen that also disappear when the pipeline stops.

For IP reasons, I can't share my model but I managed to reproduce the bug with a Yolov5 tflite model publicly available in Kaggle at https://www.kaggle.com/models/kaggle/yolo-v5?select=1.tflite.

Way to reproduce the problem:

# Download the model from the Kagle website (https://www.kaggle.com/models/kaggle/yolo-v5?select=1.tflite)
# this provides with the 1.tflite file
# Sorry, I did not find any way to get a direct download url on kaggle's website. You'll have to download it manually from you browser (no need to register on kaggle's website)

# Optionally: Enable VX Caching
export VIV_VX_CACHE_BINARY_GRAPH_DIR=/root/.cache/vxdelegate/
export VIV_VX_ENABLE_CACHE_GRAPH_BINARY=1

# Optionally: Enable nnshark
export GST_DEBUG="GST_TRACER:7"
export GST_TRACERS="live"

# Run the pipeline
gst-launch-1.0 videotestsrc \
! video/x-raw, format=YUY2, width=320, height=320, framerate=20/1 \
! queue max-size-buffers=1 max-size-bytes=0 max-size-time=0 leaky=downstream \
! videoconvert n-threads=4 \
! video/x-raw, format=RGB \
! tensor_converter set-timestamp=false \
! tensor_transform mode=dimchg option=0:2 \
! tensor_transform mode=arithmetic option=typecast:float32,div:255 \
! tensor_filter framework=tensorflow-lite model=1.tflite custom=Delegate:External,ExtDelegateLib:libvx_delegate.so \
! fakesink

With this setup, at the beginning, the pipeline infers at a rate of ~0.85 FPS (inference time of 1.195s reported in nnshark) and the GC8000 load is between 10 and 40%.
After few minutes (~2'30 on my setup), the bug happens. The FPS drops to ~0.05FPS (inference time of 19.181s) and the GC8000 load is stucked at 100%.

This can be observed:

either with nnshark
or directly from the command line:
- man can see the gstreamer timestamp progress is much slower
- and GPU load can be followed in another terminal with: watch -n0.1 cat /sys/kernel/debug/gc/load

Thank you by advance for any help!

Christophe_Couturier

Self reply to my own question:

The processor is heating quite a lot when the NPU is under intensive load.

For some reason, my development board was not equipped with a heat-sink.

Installing a heat-sink on the processor seems to have solved by problem. Adding a fan may also be necessary for production use. To be checked

So, for the moment, I consider my problem is solved.
Hopefully, this post can help someone with the same problem.

在原帖中查看解决方案

Christophe_Couturier

Self reply to my own question:

The processor is heating quite a lot when the NPU is under intensive load.

For some reason, my development board was not equipped with a heat-sink.

Installing a heat-sink on the processor seems to have solved by problem. Adding a fan may also be necessary for production use. To be checked

So, for the moment, I consider my problem is solved.
Hopefully, this post can help someone with the same problem.