I'm trying to compare the performance of OpeCV algorithms on CPU and on GPU using OpenCL capabilities on the IMX8M.
The results I get using GPU (UMAT) are worse than on CPU(MAT).
I was checking the gputop and top tools while running and I saw some activity on GPU while running functions using GPU (UMAT), however, CPU was always between 95% to 100% on CPU and on GPU functions.
I added the above lines to local.conf file:
IMAGE_INSTALL_append = " \
gputop \
"
IMAGE_INSTALL_append = " imx-gpu-viv opencv-dev"
IMAGE_INSTALL_append = " opencv opencv-samples"
What could be the problem?
Am I missing some compilation flag?
OpenCV in the i.MX Linux BSP uses the ARM NEON accelerators and not the GPU. That's why the CPU load is so high. I think when you force it to the GPU with UMat, then you get only little support by the GPU and the rest is done on the CPUs. If you work with Mat, then the CPUs and NEON are used - and this seems to be more efficient.
Hello,
Can you please share the Linux BSP that you are using, and the board that you have, in order to see the amount of memory that you are having in your system. We will like to try to replicate to see this issue. Which version of OpenCV you use with your CPU. Do you use any patch for the GPU?.
Any details even the example to replicate would be great.
Hi,
I'm using yocto warrior-fsl-4.19.35-mx8mq-v1.0 from:
https://github.com/varigit/variscite-bsp-platform.git
My OpenCV version is 4.0.1
Here is the OpenCL info:
I don't have any patches for GPU.
GPU memory is 256MB.
Code example:
int testUMAT ()
{
int counter = 100;
bool isImshow = true;
std::chrono::steady_clock::time_point begin;
std::chrono::steady_clock::time_point end;
cv::Mat testMat (768,1024,CV_8UC1 );
cv::Mat testNuc (768,1024,CV_8UC1 );
// Defining GPU matrices
cv::UMat testMatGpu , testNucGpu, testMatTarget;
// Randomizing image
cv::randu(testMat, 0, (int)pow(2, 8));
cv::randu(testNuc, 0, (int)pow(2, 8));
testMat.copyTo(testMatGpu);
testNuc.copyTo(testNucGpu);
auto start = chrono::high_resolution_clock::now();
for(int i=0;i<counter;i++)
{
cv::multiply(testMatGpu, testNucGpu, testNucGpu);
}
auto end = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::microseconds>(end - start);
cout << "End test:" << duration.count() / (1000.0 * counter) << endl;
return 0;
}
Thanks!