Hi,
since two weeks I'm trying to get OpenCL working on Sabre SDB Board. (SABRE for Smart Devices Reference Design)
There is one PDF "White Paper | Get started with OpenCL on i.MX6" (https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCQQFjAA&url=http...)
I tried to get through it: The board support package referenced there seems to be not existent any more and some dependencies are brocken. I took latest BSP instead.
I tried installing ltib on Ubuntu 14.04: After two days and many issues finally get it compiling. Failed at the very end and image building because some kernel files which are supposed to be on freescale server were not existent any more. The ltib script was not able to get them from other server either: they are just not there.
I have installed the Vivante SDK on Windows with emulator and can compile and run opencl example there.
I have a running Yocto on Sabre Board, kernel 3.10.53 and I have another "chroot" environment with gcc,cmake and other build stuff. I can compile and run arbitrary code there as long as it doesn't use OpenCL!
As soon as I include libOpenCL.so it seems to have many dependencies which are not clear: libVIVANTE.so, libGAL.so, and some wayland stuff.
After another couple of days I got the OpenCL "hello world" from white paper compiled (after fixing 5-6 issues there) but if I start it I get error message:
My question is rather simple: Is there anybody who could provide complete SDCARD image for the mentioned board with working OpenCL?. A filesystem for chroot using kernel 3.10.53 would also be a great help.
Just for reference here are some md5sum hashcodes of headers and libs I've tried:
ab7cd3a124af4381f3a66d8ac3bacbb0 ./media/mmcblk3p1/usr/include/CL/cl.hpp
801cd16dda62704b450865d4e752a9c5 ./media/mmcblk1p1/usr/include/CL_orig/cl.hpp
7625c9f38bbca0e97dbf7c272d53c219 ./media/mmcblk1p1/home/data/gpu_viv/opt/fsl/include/CL/cl.hpp
698841f89f5e50f942eb34e70b2de929 ./media/mmcblk3p1/usr/lib/libOpenCL.so
1fe9a4875d55bf041a0a382234cb77ea ./media/mmcblk1p1/home/data/gpu_viv/opt/fsl/lib/libOpenCL.so
Solved! Go to Solution.
Hi Alexander,
As you said the kernel is not optimized for GPU, since it generates a lot of overhead, so this is why you are facing a poor performance. Actually with my experience you need to get rid of all your for or while loops, you need to unroll the loops and take advantage of the paralelism. The limitations in our OpenCL profile is related to atomics, kernel size (number of instructions) and the copy when reading the data back from gpu memory to a buffer.
Regards,
Andre
Unfortunatelly I can't share the complete source code.
Hi Alexander,
As you said the kernel is not optimized for GPU, since it generates a lot of overhead, so this is why you are facing a poor performance. Actually with my experience you need to get rid of all your for or while loops, you need to unroll the loops and take advantage of the paralelism. The limitations in our OpenCL profile is related to atomics, kernel size (number of instructions) and the copy when reading the data back from gpu memory to a buffer.
Regards,
Andre
This code search local maxima in a grey image. Each kernel (work item) process a tile of size 64x64 pixels.
Parameters are set as follows:
tileSizeBits=6
int tileSize=(1<<tileSizeBits); //64 |
imgWidth=1280
imgHeight=1024
outTileSize=512
bg_limit=5
d=3
// intermediate params
global[0] = imgWidth/tileSize;
global[1] = imgHeight/tileSize;
totalTilesNum=(int)(global[0]*global[1]);
outBufferSize = totalTilesNum*outTileSize*sizeof(short);
input=pointer to memory of size 1280x1024 bytes (grey image)
output=pointer to memory of size outBufferSize
I suppose my kernel above is not optimal for OpenCL and GPU execution.
I've tested another kernel (see below) and it looks like GPU is involved.
The CPU load is 50% (2 cores out of 4 are full) with OpenCL computation and 100% (all 4 cores are full) with CPU computation.
Power consumption is also different: 19V,320mA with GPU, 250mA with CPU.
Performance depends on number of work items. Sometimes CPU is faster, sometimes GPU.
To run kernel below in GPU 256 times I need about 15 seconds. If running it on CPU then 26 seconds.
///----------------------------------------
__kernel void finddots ( int tileSizeBits,
int imgWidth,
int imgHeight,
int outTileSize,
int bg_limit,
int d,
__global uchar *input,
__global ushort *output
)
{
uint tileID = (uint)(get_global_size(0)*get_global_id(1)+get_global_id(0));
int i;
float4 dd1, dd2, dd3, dd4, dd5;
dd2.x=(float)tileID;
dd2.y=(float)tileID;
dd2.z=(float)tileID;
dd2.w=(float)tileID;
dd1=dd2;
dd3=dd2;
dd4=dd2;
dd5=dd2;
for(i=0; i<10000000;i++)
{
dd3 += dd2*dd5;
dd2 += dd1*dd4;
dd4 += dd3*dd1;
dd1 += dd3*dd2;
dd3 += dd2*dd5;
dd2 += dd1*dd4;
dd4 += dd3*dd1;
dd1 += dd3*dd2;
}
((__global float*)output)[tileID]=(dd1.x);
}
__kernel void finddots_2 (
int tileSizeBits,
int imgWidth,
int imgHeight,
int outTileSize,
int bg_limit,
int d,
__global uchar *input,
__global ushort *output
)
{
if (get_work_dim()!=2) return;
// Tile ID
uint tileID = (uint)(get_global_size(0)*get_global_id(1)+get_global_id(0));
#ifdef ECL
printf("tileID %d\n",tileID);
#endif
__global ushort * tileOutPtr=output+(tileID*outTileSize);
uint rowPitch=imgWidth;
// we start with 1 because we will store the length in the [0] component
uint valuesStored=1;
uint limitValuesStored=(outTileSize-sizeof(ushort))/2/sizeof(ushort)-1;
int x_begin=max((int)(get_global_id(0)<<tileSizeBits),(int)d);
int y_begin=max((int)(get_global_id(1)<<tileSizeBits),(int)d);
int x_end=min((int)((get_global_id(0)+1)<<tileSizeBits),(int)(imgWidth-d));
int y_end=min((int)((get_global_id(1)+1)<<tileSizeBits),(int)(imgHeight-d));
//printf("Range %d-%d,%d-%d\n",x_begin,x_end,y_begin,y_end);
uint tileMask=(1<<tileSizeBits)-1;
for (int y=y_begin;y<y_end;y++)
{
for (int x=x_begin;x<x_end;x++)
{
uchar xy0 = input[rowPitch*y+x];
bool candSearchOk=true;
if (xy0>bg_limit)
{
for (int yi=y-d;(yi<=y+d) && candSearchOk;yi++)
for (int xi=x-d;xi<=x+d;xi++)
{
uchar xyi=input[rowPitch*yi+xi];
if (xyi>xy0)
{
candSearchOk=false;
break;
}
else
if (xyi==xy0)
{
// depends where it is
if (!( // same pixel
((yi==y) && (xi==x)) ||
// its the direct bottom neighbor or
( (yi-y==1) && (xi==x) ) ||
// one of the three right neighbors
( ( (yi-y==-1) || (yi-y==0) || (yi-y==1) ) && ( xi-x==1) )
))
{
candSearchOk=false;
break;
}
}
}
if (candSearchOk)
{
// pixel found->refine and store
// useThisDot=true
// todo: refinement
int subPixelX=0;
int subPixelY=0;
#ifdef ECL
printf(" candidate at %d,%d \n",x,y);
#endif
tileOutPtr[valuesStored] = (ushort)(((x&tileMask)<<((sizeof(ushort)*8)-tileSizeBits))+subPixelX);
tileOutPtr[valuesStored+1] = (ushort)(((y&tileMask)<<((sizeof(ushort)*8)-tileSizeBits))+subPixelY);
valuesStored+=2;
if (valuesStored>=limitValuesStored)
{
x=x_end;
y=y_end;
break;
}
}
}
}
}
// store bytes stored by this work item
tileOutPtr[0] = (ushort)(valuesStored*sizeof(ushort));
}
My OpenCl test application works now but I'm really disappointed with the OpenCL performance. I even suppose that GPU is not involved in OpenCL execution. I have low CPU usage if I have trivial OpenCL kernel. But if put more code inside OpenCL kernel my CPU usage is going higher and higher.
Executing the same kernel on CPU (by compiling it like normal C code) gives even better performance with lower CPU usage. This doesn't make sense to me.
Also there is a strange behaviour with OpenCL kernel complexity. Even if I have one thread for OpenCL my CPU can achive 100% utilization of all 4 cores.
For me it's indicator that OpenCL performs kernel execution on CPU.
Is there a way to check GPU load? Or clearly identify that GPU is involved in OpenCL kernel execution?
Hi Alexander,
can you past your kernel source ? I want to take a look.
thanks,
Andre
Actually, if possible, share your entire application, I want to take a look and test myself.
cheers,
Andr
AndreSilva do you know if there is any sdcard image with this implementation available?.
thanks and regards
Hi there,
have you added the CORE_IMAGE_EXTRA_INSTALL += "imx-gpu-viv imx-gpu-viv-dev" in your local.conf ?
cheers.
Andre
Thanks for response. Where the local.conf file is supposed to be? It's not in my ltib folder and is not on my yocto sd card.
check out this how to in the imxcv blog, you will find the answer:
http://imxcv.blogspot.com.br/2014/08/onboard-camera-v4l-wrapper-with-yocto.html
regards,
Andre
Thanks for Your help. I'm new to kernel compiling so I didn't know what local.conf is for.
At the moment I'm trying another way: After booting I do chroot to the ubuntu oneric image provided by freescale. There I do the compilation of example OpenCL programm directly on the board.
But if I start the application it always crashes with segfault if program tries to enqueue write or read buffers for the opencl shader. I tested it with two different example "hello world" programms and both do this seg fault.
One of the this works perfectly on Windows with VIVANTE SDK and emulator, so the problem is not in the code. I suppose some GPU drivers are missing. Is there a way to check whether GPU drivers are installed?
The gpu driver should be already installed on kernel (built-in module), if not you can just modprobe galcore, other thing is to check if the CL libraries are in place. so you can check that by ls /usr/lib/*CL.so
regards
Andre
I've got CL working now. I've found another linux system on Sabre board which does not have dependencies to wayland and don't need wayland libraries. This works fine if used with chroot after booting kernel from yocto. The Ubuntu builds or Yocto builds seems to have strange dependencies of OpenCL to wayland and OpenCL crashes (at least on my board).
By the way I found some strange behaviour of "hello world" example given in the OpenCL white paper. The 2-dimensional buffer is allocated as array of pointers (one for each row) where each then points to column data. The OpenCL kernel seems to expect linear array and computes linear index from x,y. How this works together? In my tests every attempt to change OpenCL kernel ended with crash until I've changed the buffer to linear array.
the OpenCL kernel "converts" the two array input in a linear array. It is the easiest way to deal with the data.
cheers,
Andre