Need Help with SD card image with working gcc, OpenCL and example of using OpenCL shader

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Need Help with SD card image with working gcc, OpenCL and example of using OpenCL shader

Jump to solution
2,395 Views
alexanderkramer
Contributor II

Hi,

since two weeks I'm trying to get OpenCL working on Sabre SDB Board. (SABRE for Smart Devices Reference Design)

There is one PDF "White Paper | Get started with OpenCL on i.MX6"  (https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCQQFjAA&url=http...)

I tried to get through it: The board support package referenced there seems to be not existent any more and some dependencies are brocken. I took latest BSP instead.

I tried installing ltib on Ubuntu 14.04: After two days and many issues finally get it compiling. Failed at the very end and image building because some kernel files which are supposed to be on freescale server were not existent any more. The ltib script was not able to get them from other server either: they are just not there.

I have installed the Vivante SDK on Windows with emulator and can compile and run opencl example there.

I have a running Yocto on Sabre Board, kernel 3.10.53 and I have another "chroot" environment with gcc,cmake and other build stuff. I can compile and run arbitrary code there as long as it doesn't use OpenCL!

As soon as I include libOpenCL.so it seems to have many dependencies which are not clear: libVIVANTE.so, libGAL.so, and some wayland stuff.

After another couple of days I got the OpenCL "hello world" from white paper compiled  (after fixing 5-6 issues there) but if I start it I get error message:

      • libgc_wayland_protocol.so.0 undefined symbol wl_registry_interface

My question is rather simple: Is there anybody who could provide complete SDCARD image for the mentioned board with working OpenCL?. A filesystem for chroot using kernel 3.10.53 would also be a great help.

Just for reference here are some md5sum hashcodes of headers and libs I've tried:

ab7cd3a124af4381f3a66d8ac3bacbb0  ./media/mmcblk3p1/usr/include/CL/cl.hpp

801cd16dda62704b450865d4e752a9c5  ./media/mmcblk1p1/usr/include/CL_orig/cl.hpp

7625c9f38bbca0e97dbf7c272d53c219 ./media/mmcblk1p1/home/data/gpu_viv/opt/fsl/include/CL/cl.hpp

698841f89f5e50f942eb34e70b2de929  ./media/mmcblk3p1/usr/lib/libOpenCL.so

1fe9a4875d55bf041a0a382234cb77ea ./media/mmcblk1p1/home/data/gpu_viv/opt/fsl/lib/libOpenCL.so

Tags (1)
0 Kudos
1 Solution
1,898 Views
andre_silva
NXP Employee
NXP Employee

Hi Alexander,

As you said the kernel is not optimized for GPU, since it generates a lot of overhead, so this is why you are facing a poor performance. Actually with my experience you need to get rid of all your for or while loops, you need to unroll the loops and take advantage of the paralelism. The limitations in our OpenCL profile is related to atomics, kernel size (number of instructions) and the copy when reading the data back from gpu memory to a buffer.

Regards,

Andre

View solution in original post

0 Kudos
16 Replies
1,898 Views
alexanderkramer
Contributor II

Unfortunatelly I can't share the complete source code.

0 Kudos
1,899 Views
andre_silva
NXP Employee
NXP Employee

Hi Alexander,

As you said the kernel is not optimized for GPU, since it generates a lot of overhead, so this is why you are facing a poor performance. Actually with my experience you need to get rid of all your for or while loops, you need to unroll the loops and take advantage of the paralelism. The limitations in our OpenCL profile is related to atomics, kernel size (number of instructions) and the copy when reading the data back from gpu memory to a buffer.

Regards,

Andre

0 Kudos
1,898 Views
alexanderkramer
Contributor II

This code search local maxima in a grey image. Each kernel (work item) process a tile of size 64x64 pixels.

Parameters are set as follows:

tileSizeBits=6

int tileSize=(1<<tileSizeBits); //64

imgWidth=1280

imgHeight=1024

outTileSize=512

bg_limit=5

d=3

// intermediate params

    global[0] = imgWidth/tileSize;

    global[1] = imgHeight/tileSize;

    totalTilesNum=(int)(global[0]*global[1]);

   

    outBufferSize = totalTilesNum*outTileSize*sizeof(short);

input=pointer to memory of size 1280x1024 bytes (grey image)

output=pointer to memory of size outBufferSize

0 Kudos
1,898 Views
alexanderkramer
Contributor II

I suppose my kernel above is not optimal for OpenCL and GPU execution.

I've tested another kernel (see below) and it looks like GPU is involved.

The CPU load is 50% (2 cores out of 4 are full) with OpenCL computation and 100% (all 4 cores are full) with CPU computation.

Power consumption is also different: 19V,320mA with GPU, 250mA with CPU.

Performance depends on number of work items. Sometimes CPU is faster, sometimes GPU.

To run kernel below in GPU 256 times I need about 15 seconds. If running it on CPU then 26 seconds.

///----------------------------------------

__kernel void finddots ( int tileSizeBits,

int imgWidth,

int imgHeight,

int outTileSize,

int bg_limit,

int d,

__global uchar *input,

__global ushort *output

)

{

     uint tileID = (uint)(get_global_size(0)*get_global_id(1)+get_global_id(0));

     int i;

    

     float4 dd1, dd2, dd3, dd4, dd5;

     dd2.x=(float)tileID;

     dd2.y=(float)tileID;

     dd2.z=(float)tileID;

     dd2.w=(float)tileID;

     dd1=dd2;

     dd3=dd2;

     dd4=dd2;

     dd5=dd2;

     for(i=0; i<10000000;i++)

     {

        dd3 += dd2*dd5;

        dd2 += dd1*dd4;

        dd4 += dd3*dd1;

        dd1 += dd3*dd2;

        dd3 += dd2*dd5;

        dd2 += dd1*dd4;

        dd4 += dd3*dd1;

        dd1 += dd3*dd2;

       

    }

    ((__global float*)output)[tileID]=(dd1.x);

}

0 Kudos
1,898 Views
alexanderkramer
Contributor II

__kernel void finddots_2 (

int tileSizeBits,

int imgWidth,

int imgHeight,

int outTileSize,

int bg_limit,

int d,

__global uchar *input,

__global ushort *output

)

{

    if (get_work_dim()!=2) return;

   

    // Tile ID

    uint tileID = (uint)(get_global_size(0)*get_global_id(1)+get_global_id(0));

#ifdef ECL

    printf("tileID %d\n",tileID);

#endif   

    __global ushort * tileOutPtr=output+(tileID*outTileSize);

   

    uint rowPitch=imgWidth;

   

    // we start with 1 because we will store the length in the [0] component

    uint valuesStored=1;

   

    uint limitValuesStored=(outTileSize-sizeof(ushort))/2/sizeof(ushort)-1;

       

    int x_begin=max((int)(get_global_id(0)<<tileSizeBits),(int)d);

    int y_begin=max((int)(get_global_id(1)<<tileSizeBits),(int)d);

    int x_end=min((int)((get_global_id(0)+1)<<tileSizeBits),(int)(imgWidth-d));

    int y_end=min((int)((get_global_id(1)+1)<<tileSizeBits),(int)(imgHeight-d));

    

    //printf("Range %d-%d,%d-%d\n",x_begin,x_end,y_begin,y_end);

    uint tileMask=(1<<tileSizeBits)-1;

   

    for (int y=y_begin;y<y_end;y++)

    {

       

        for (int x=x_begin;x<x_end;x++)

        {

   

            uchar xy0 = input[rowPitch*y+x];   

            bool candSearchOk=true;

            if (xy0>bg_limit)

            {

                for (int yi=y-d;(yi<=y+d) && candSearchOk;yi++)

                for (int xi=x-d;xi<=x+d;xi++)

                {

                    uchar xyi=input[rowPitch*yi+xi];

                    if (xyi>xy0)

                    {

                        candSearchOk=false;

                        break;

                    }

                    else

                    if (xyi==xy0)

                    {

                        // depends where it is

                                if (!(  // same pixel

                                        ((yi==y) && (xi==x)) ||

                                        // its the direct bottom neighbor or

                                        ( (yi-y==1) && (xi==x) ) ||

                                        // one of the three right neighbors

                                        ( ( (yi-y==-1) || (yi-y==0) || (yi-y==1) ) && ( xi-x==1) )

                                    ))

                                {

                                    candSearchOk=false;

                                    break;

                                }

                           

                    }

                }

           

                if (candSearchOk)

                {

                    // pixel found->refine and store

                    // useThisDot=true

                    // todo: refinement

                    int subPixelX=0;

                    int subPixelY=0;

    #ifdef ECL

                    printf("  candidate at %d,%d \n",x,y);

    #endif

                    tileOutPtr[valuesStored] = (ushort)(((x&tileMask)<<((sizeof(ushort)*8)-tileSizeBits))+subPixelX);

                    tileOutPtr[valuesStored+1] = (ushort)(((y&tileMask)<<((sizeof(ushort)*8)-tileSizeBits))+subPixelY);

                    valuesStored+=2;

                    if (valuesStored>=limitValuesStored)

                    {

                        x=x_end;

                        y=y_end;

                        break;

                    }

                }

            }

        }

    }

    // store bytes stored by this work item

    tileOutPtr[0] = (ushort)(valuesStored*sizeof(ushort));

}

0 Kudos
1,898 Views
alexanderkramer
Contributor II

My OpenCl test application works now but I'm really disappointed with the OpenCL performance. I even suppose that GPU is not involved in OpenCL execution. I have low CPU usage if I have trivial OpenCL kernel. But if put more code inside OpenCL kernel my CPU usage is going higher and higher.

Executing the same kernel on CPU (by compiling it like normal C code) gives even better performance with lower CPU usage. This doesn't make sense to me.

Also there is a strange behaviour with OpenCL kernel complexity. Even if I have one thread for OpenCL my CPU can achive 100% utilization of all 4 cores.

For me it's indicator that OpenCL performs kernel execution on CPU.

Is there a way to check GPU load? Or clearly identify that GPU is involved in OpenCL kernel execution?

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

Hi Alexander,

can you past your kernel source ? I want to take a look.

thanks,

Andre

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

Actually, if possible, share your entire application, I want to take a look and test myself.

cheers,

Andr

0 Kudos
1,898 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

AndreSilva do you know if there is any sdcard image with this implementation available?.

thanks and regards

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

Hi there,

have you added the CORE_IMAGE_EXTRA_INSTALL += "imx-gpu-viv imx-gpu-viv-dev" in your local.conf ?

cheers.

Andre

0 Kudos
1,898 Views
alexanderkramer
Contributor II

Thanks for response. Where the local.conf file is supposed to be? It's not in my ltib folder and is not on my yocto sd card.

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

check out this how to in the imxcv blog, you will find the answer:

http://imxcv.blogspot.com.br/2014/08/onboard-camera-v4l-wrapper-with-yocto.html

regards,

Andre

0 Kudos
1,898 Views
alexanderkramer
Contributor II

Thanks for Your help. I'm new to kernel compiling so I didn't know what local.conf is for.

At the moment I'm  trying another way: After booting I do chroot to the ubuntu oneric image provided by freescale. There I do the compilation of example OpenCL programm directly on the board.

But if I start the application it always crashes with segfault if program tries to enqueue write or read buffers for the opencl shader. I tested it with two different example "hello world" programms and both do this seg fault.

One of the this works perfectly on Windows with VIVANTE SDK and emulator, so the problem is not in the code. I suppose some GPU drivers are missing. Is there a way to check whether GPU drivers are installed?

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

The gpu driver should be already installed on kernel (built-in module), if not you can just modprobe galcore, other thing is to check if the CL libraries are in place. so you can check that by ls /usr/lib/*CL.so

regards

Andre

0 Kudos
1,898 Views
alexanderkramer
Contributor II

I've got CL working now. I've found another linux system on Sabre board which does not have dependencies to wayland and don't need wayland libraries. This works fine if used with chroot after booting kernel from yocto. The Ubuntu builds or Yocto builds seems to have strange dependencies of OpenCL to wayland and OpenCL crashes (at least on my board).

By the way I found some strange behaviour of "hello world" example given in the OpenCL white paper. The 2-dimensional buffer is allocated as array of pointers (one for each row) where each then points to column data. The OpenCL kernel seems to expect linear array and computes linear index from x,y. How this works together? In my tests every attempt to change OpenCL kernel ended with crash until I've changed the buffer to linear array.

0 Kudos
1,898 Views
andre_silva
NXP Employee
NXP Employee

the OpenCL kernel "converts" the two array input in a linear array. It is the easiest way to deal with the data.

cheers,

Andre

0 Kudos