libusb1 transfer stall when doing NPU computation on i.mx8mp

sundermeyer · ‎02-27-2024

On my custom i.MX8M+ board, I have two processes running. One continuously acquires images from a basler USB 3 VISION camera (da3840-45uc) using pylon or aravis (Tested both). This process runs fine alone. Then, we have another process that continuously executes a tensorflow lite model using the `libvx_delegate.so` tflite delegate for NPU acceleration.

The problem:
Whenever those programs are executed concurrently, we first get a LIBUSB_TRANSFER_STALL and then only LIBUSB_TRANSFER_ERROR messages instead of successfully acquired image data.

Both programs do not exchange any data. Both programs are lightweight in terms of IO bandwidth, memory bandwidth and cpu load (I have checked that). Both programs were written in python (3.10). The kernel version is 6.1 The kernel source is the freescale community bsp (meta-freescale).

I have played with the nice values of those two tasks, I have set different interrupt affinities, I have monitored the memory bandwidth using perf, tried different kernel versions, different OpenEmbedded userspaces (kirkstone, mickledore), I have tried different galcore versions.
---> No success so far.
The error doesn't occur on the NXP i.mx8m+ lpddr4 evk.

I am now desperately seeking for advice on new debugging ideas.

brian14 · ‎02-27-2024

Hi @sundermeyer,

Thank you for contacting NXP Support.

Could you please share the Gstreamer pipeline to implement your USB camera?

For debugging purposes, you can try to reduce the frames and resolution to the lowest possible and try running both tasks.

Please try and tell me your results.

sundermeyer · ‎02-28-2024

Hi @brian14 ,

thank you for your quick reply.

I don't have a GStreamer pipeline implemented. For the video stream I currently use Aravis. Pylon/Pypylon and GeniCam Harvesters show the same behavior. It is really just a basic loop:

import gi
import signal

gi.require_version('Aravis', '0.8')

from gi.repository import Aravis

class SIGINT_handler():
    def __init__(self):
        self.SIGINT = False

    def signal_handler(self, signal, frame):
        print('You pressed Ctrl+C!')
        self.SIGINT = True

handler = SIGINT_handler()
signal.signal(signal.SIGINT, handler.signal_handler)

camera = None
camera = Aravis.Camera.new(None)
if not camera:
    raise IOError("No camera found.")

camera.set_region(0, 0, 2160, 1620)
camera.set_frame_rate(20.0)

stream = camera.create_stream()
payload = camera.get_payload()

# Allocate 10 buffers
for i in range(10):
    stream.push_buffer(Aravis.Buffer.new(payload))

camera.start_acquisition()

while not handler.SIGINT:
    buffer = stream.timeout_pop_buffer(1000000)
    if buffer != None:
        print("Buffer {0}x{1} {2}".format(buffer.get_image_width(), buffer.get_image_height(), buffer))
        stream.push_buffer(buffer)
    else:
        print("buffer is None")

camera.stop_acquisition()

For the inference task I also have just a basic program:

import signal, time
import numpy as np
import tflite_runtime.interpreter as tfl

class SIGINT_handler():
    def __init__(self):
        self.SIGINT = False

    def signal_handler(self, signal, frame):
        print('You pressed Ctrl+C!')
        self.SIGINT = True
        
handler = SIGINT_handler()
signal.signal(signal.SIGINT, handler.signal_handler)

interpreter = tfl.Interpreter(
    "mobilenet_v1_1.0_224_quant.tflite",  
    experimental_delegates=[tfl.load_delegate("/usr/lib/libvx_delegate.so")]  
)
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.allocate_tensors()
images_shape = input_details[0]['shape']

def process_image():
    dummy_data = np.random.uniform(0.0, 255.0, images_shape).astype('uint8')
    interpreter.set_tensor(input_details[0]['index'], dummy_data)
    interpreter.invoke()
    for od in output_details:
        trash = interpreter.get_tensor(od['index'])
# warm up
process_image()

while not handler.SIGINT:
    time.sleep(0.2)
    process_image()

It uses the label_image.py example model that was shipped with tensorflow lite.

The output of the camera thread with ARV_DEBUG="all:2" is:

[01:56:13.568] 🅸 stream> SIRM_INFO             = 0x02000000
[01:56:13.568] 🅸 stream> SIRM_REQ_PAYLOAD_SIZE = 0x00000000000d5930
[01:56:13.568] 🅸 stream> SIRM_REQ_LEADER_SIZE  = 0x00000400
[01:56:13.568] 🅸 stream> SIRM_REQ_TRAILER_SIZE = 0x00000400
[01:56:13.568] 🅸 stream> Required alignment    = 4
[01:56:13.570] 🅸 stream> SIRM_PAYLOAD_SIZE     = 0x000d5930
[01:56:13.570] 🅸 stream> SIRM_PAYLOAD_COUNT    = 0x00000001
[01:56:13.570] 🅸 stream> SIRM_TRANSFER1_SIZE   = 0x00000000
[01:56:13.570] 🅸 stream> SIRM_TRANSFER2_SIZE   = 0x00000000
[01:56:13.570] 🅸 stream> SIRM_MAX_LEADER_SIZE  = 0x00000400
[01:56:13.570] 🅸 stream> SIRM_MAX_TRAILER_SIZE = 0x00000400
[01:56:13.571] 🅸 stream-thread> Start async USB3Vision stream thread
[01:56:13.614] 🅸 device> [[UvDevice::write_memory] Try 1/5: unexpected answer (0x0000)
[01:56:16.939] 🆆 stream-thread> Payload transfer failed (LIBUSB_TRANSFER_STALL)
[01:56:16.939] 🆆 stream-thread> Trailer transfer failed (LIBUSB_TRANSFER_ERROR)
[01:56:16.940] 🆆 stream-thread> Leader transfer failed (LIBUSB_TRANSFER_ERROR)
[01:56:16.941] 🆆 stream-thread> Payload transfer failed (LIBUSB_TRANSFER_ERROR)
[01:56:16.941] 🆆 stream-thread> Trailer transfer failed (LIBUSB_TRANSFER_ERROR)
[01:56:16.942] 🆆 stream-thread> Leader transfer failed (LIBUSB_TRANSFER_ERROR)
[01:56:16.942] 🆆 stream-thread> Payload transfer failed (LIBUSB_TRANSFER_ERROR)

More about the error messages from pypylon and the usbmon output can be found in this issue .

When I reduce the Image resolution, the error doesn't occur immediately. E.g. 300x300 @20fps takes 30 seconds untill we receive a USB transfer stall while the inference thread is running with 200ms sleep time. When we reduce the size further (100 x 100 @ 20), we can run the tasks for 10 minutes without the error.

I believe this is a timing issue or any resource contention between the USB and NPU drivers. How can I get more debugging information out of those drivers?

Thanks and Regards.

brian14 · ‎02-29-2024

Hi @sundermeyer,

Thank you for your detailed information.

Unfortunately, I'm not familiarized with Aravis therefore, I can't give you a specific advice regarding that library. However, I suggest you implement a Gstreamer pipeline using different resolutions and fps.
Based on my experience, it seems that this is a bandwidth issue. Usually, we see that kind of problems with USB cameras. For example, I have worked with 2 simultaneous cameras streaming with a Gstreamer pipeline and I noticed that after a short time the stream is paused with the message "Failed to allocate required memory".
It could be a good exercise trying to do your application using a MIPI CSI camera and executing your machine learning model in parallel.

Please try and tell me if it works.

Have a great day!

sundermeyer · ‎03-05-2024

Hi @brian14

Thank you for your help.

I installed the gst-plugin-pylon and tested a simple pipeline:

gst-launch-1.0 pylonsrc capture-error=skip ! "video/x-raw,width=1920,height=1080,framerate=20/1,format=RGB" ! videoconvert ! autovideosink

Like in the previous tests, every two seconds, we get one of the following pylon errors:

- Read operation failed.

- Payload data has been discarded. Payload data can be discarded by the camera device if the available bandwidth is insufficient.

- Empty error description

- The current block ID must be larger than the previous block ID.

This again happens only during NPU computation.
Regarding memory- and io-bandwidth:
- 1920 * 1080 * 3 * 20 B/s is not much and it also happens with smaller resolutions like 480x640@20 (though not as frequent).
- I have checked the performance counters of memory read and write with camera process + model process and camera process + stress-ng. We can easily generate more load with stress-ng, but the camera process works fine. Maybe I can do something similar for generating IO load, but honestly, I have no idea about the internals of the i.mx processors.
- I also tested a UVC webcam with high resolution and frame rate and it seems do to fine.

Regarding MIPI-CSI, there are other issues, so I can't test that currently.

It is also likely that the error is in our hardware, since the tests work on the nxp evk. So, unless there aren't any ideas regarding the kernel as the cause, I would close this issue for now.

Regards
Robert