Dlib detector call is too slow in IMX8QXP

I'm trying to test dlib "webcam_face_pose_ex" example in IMX8QXP board.
The below code is taking 2161 millisecond approximately. And CPU is more than 120%

cv_image<bgr_pixel> cimg(temp);
std::vector<rectangle> faces = detector(cimg, 0);</rectangle></bgr_pixel>

Whereas in Raspberry pi 3 the same code taking 675 milliseocond. [ just for reference] 

CPU: 4 * cortex A35
Ram : 3 Gb
Os: 64 bit linux os.

Raspberry pi 3:
CPU: 4 * cortex a53
ram : 1Gb
Os: 32bit rasbian buster.

I have tried the compiler flags like -mcpu and more.
Nothing working.
Also tried Openblas library and no improvement in IMX8.

I'm having a high end CPU and still lagging in performance.
How can i improve the perfomace.