The results I get using GPU (UMAT) are worse than on CPU(MAT)

I'm trying to compare the performance of OpeCV algorithms on CPU and on GPU using OpenCL capabilities on the IMX8M.
The results I get using GPU (UMAT) are worse than on CPU(MAT). 
I was checking the gputop and top tools while running and I saw some activity on GPU while running functions using GPU (UMAT), however, CPU was always between 95% to 100% on CPU and on GPU functions.

I added the above lines to local.conf file:
IMAGE_INSTALL_append = " \
    gputop \
IMAGE_INSTALL_append = " imx-gpu-viv opencv-dev"
IMAGE_INSTALL_append = " opencv opencv-samples"

What could be the problem?

Am I missing some compilation flag?

NXP TechSupport
Can you please share the Linux BSP that you are using, and the board that you have, in order to see the amount of memory that you are having in your system.  We will like to try to replicate to see this issue. Which version of OpenCV you use with your CPU.  Do you use any patch for the GPU?.

Any details even the example to replicate would be great.

Contributor I


I'm using yocto warrior-fsl-4.19.35-mx8mq-v1.0 from:

My OpenCV version is 4.0.1

Here is the OpenCL info:


I don't have any patches for GPU. 

GPU memory is 256MB.

Code example: 


int testUMAT ()
int counter = 100;

bool isImshow = true;
std::chrono::steady_clock::time_point begin;
std::chrono::steady_clock::time_point end;

cv::Mat testMat (768,1024,CV_8UC1 );
cv::Mat testNuc (768,1024,CV_8UC1 );

// Defining GPU matrices
cv::UMat testMatGpu , testNucGpu, testMatTarget;

// Randomizing image
cv::randu(testMat, 0, (int)pow(2, 8));
cv::randu(testNuc, 0, (int)pow(2, 8));


auto start = chrono::high_resolution_clock::now();
for(int i=0;i<counter;i++)
cv::multiply(testMatGpu, testNucGpu, testNucGpu);
auto end = chrono::high_resolution_clock::now();

auto duration = chrono::duration_cast<chrono::microseconds>(end - start);
cout << "End test:" << duration.count() / (1000.0 * counter) << endl;
return 0;


