Hi @woohyoungshin,
Sorry for the delayed reply.
I have been working on your case and I found that in our latest BSP release the benchmarks for the iMX8MP shows the following:
For CPU using 4 cores:

For NPU:

With this results we can see a time decrease from 1.427 seconds to 121.617 milliseconds. (~x11 faster than NPU).
In addition to that, after ONNX to TFLite conversion we can see that there are many transpose and Conv2D operators that affects significantly the time for CPU inference. Here is the op profiling:

We can see that CONV_2D and TRANSPOSE take around 1 second.
In contrast, on the NPU those operators are fully supported and accelerate the inference.
Based on the release notes from iMX Machine Learning those operators are bug fixed from your BSP version to the latest BSP version.
Therefore, I would like to suggest upgrading your BSP version to the latest release and testing your model.
I hope this information will be helpful.
Have a great day!