Hi All,
Is it possible to execute OpenCL ' Kernel from Gstreamer Plugin during running Gstreamer pipeline?
I want to use OpenCL from gstreamer plugin while running any video through gst-launch.
I am adding one sample Gstreamer Plugin (e.g. sampleocl-plugin) into the Gstreamer Pipeline. From 'sampleocl-plugin' i am creating OpenCL Context, creating two Image2D memory object using clCreateImage2D, creating & loading kernel, creating command queue, setargs for kernel (clSetKernelArg(....., &srcimage2D) , clSetKernelArg(...., &dstimage3D)).
Then executing kernel clEnqueueNDRangeKernel with 2 dimensional NDRangeKernel, global work size set to global[0]=640, global[1]=480. & wait for completion of execution of kernel.
My kernel is simply doing memcpy from SRC image to DST image.
code snippet of kernel is :
////////////// START
__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
_kernel void copy_image (
__read_only image2d_t input,
__write_only image2d_t output)
{
const int2 pos = {get_global_id(0), get_global_id(1)};
float4 sum = read_imagef(input, sampler, pos);
write_imagef (output, (int2)(pos.x, pos.y), sum);
}
/////////////// END
What i am observed here that, after completion of my kernel execution, i received only one pixel from SRC image to DST image & i.e from (x=0,y=0) only.
All global Work Items {global[0]=640, global[0]=480 } MUST execute same Kernel & result MUST all pixels copied from SRC to DST.
I have done/ executed exactly same thing from one simple unit test (executable file), & its working fine. I received entire SRC image in DST image.
In gstreamer, only one pixel copied ? Is something wrong ?
Or any other way should i execute OpenCL from gstreamer ?
I am using YOCTO build system.
-Ankit.
Hi Ankit,
I tried your code and got the following result:
root@imx6qsabresd:/home/opencl_app# ./opencl_app | ||
platformIdCount: 1 | ||
platformId: 0x76e6c660 | ||
device_count: 1 | ||
deviceId: 0x76e6c4e0 | ||
clCreateContext: 0 | ||
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS SIZE: 4, iret:0 | ||
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS VAL: 3 | ||
CL_DEVICE_MAX_WORK_ITEM_SIZES SIZE: 12, iret:0 | ||
CL_DEVICE_MAX_WORK_ITEM_SIZES VAL: 1024 | ||
CL_DEVICE_MAX_WORK_ITEM_SIZES VAL: 1024 | ||
CL_DEVICE_MAX_WORK_ITEM_SIZES VAL: 1024 | ||
CL_DEVICE_MAX_WORK_GROUP_SIZE SIZE: 4, iret:0 | ||
CL_DEVICE_MAX_WORK_GROUP_SIZE VAL: 1024 | ||
CL_DEVICE_MAX_COMPUTE_UNITS SIZE: 4, iret:0 | ||
CL_DEVICE_MAX_COMPUTE_UNITS VAL: 4 | ||
CL_DEVICE_EXTENSIONS SIZE: 31, iret:0 | ||
CL_DEVICE_EXTENSIONS VAL: cl_khr_byte_addressable_store | ||
clCreateImage2D: 0 | ||
clCreateImage2D: 0 | ||
clCreateProgramWithSource: 0x6e3073e0 | ||
clBuildProgram: 0 | ||
clCreateCommandQueue: 0 | ||
clCreateKernel: 0 | ||
===IN FRAME | :3899000 | |
===IN FRAME2 | :1487667 | |
===KERNEL | :10623000 | |
===OUT FRAME | :12903000 | |
Display Result: | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb | ||
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb |
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
------------------------------------------
based in this function:
// This code will show only single channel data
void display_img_data(struct img_info *info)
{
int y, x;
unsigned int *int_ptr = (unsigned int *)info->data;
int shift = 24; // Only 1 channel out of ARGB shall be considered
for (y=0; y<16; y++) {
for (x=0; x<16; x++) {
int data = (((*int_ptr)>>(shift))&0xff);
printf("%x ",data);
}
printf("\n");
}
}
in image_data.c and this one:
void set_default_img_data(struct img_info *info, unsigned int val)
{
int y, x;
unsigned int fixval = 0x04040404;
unsigned int fixincr = 0x01010101;
unsigned int *int_ptr = (unsigned int *)info->data;
for (y=0; y<info->height; y++)
for (x=0; x<info->width; x++) {
*int_ptr = val; //(fixval + fixincr *x);
int_ptr++;
}
}
which you pass:
set_default_img_data(¶m.src, 0xbbbbbbbb); |
looks it is being correctly executed, right ?
let me know if I misunderstood it.
Cheers,
andre
Hi Andre,
This unit-test is working fine. But actual question is: when i integrate its code under gstreamer plug-in it is not working there.
Those functions (set_default_img_data & display_img_data) are for just verification whether its working or not in unit-test.
Actual behavior in gstreamer plug-in is what i mentioned in my above query ( i received only one pixel from SRC image to DST image i.e from (x=0,y=0) only).
In Gstreamer, What i have next experimented is that, Instead of using Image objects (e.g. 'clCreateImage2D' ), now i m using Buffer objects (e.g 'clCreateBuffer') for memory object creation.
Wrote new kernel api shown below. Executed that api & its working fine
/////////////////// Kernel code in imgcopy.cl ///
__kernel void copy_buffer ( __global uchar *input,
__global uchar *output,
int width,
int height)
{
int x = get_global_id (0);
int y = get_global_id (1);
if(1) {
int id = (y * width * 4) + (x * 4);
unsigned int *in = &input[id];
unsigned int *out = &output[id];
*out = (unsigned int)*in;
} else {
int id = (y * width * 4) + (x * 4);
//output[id] = input[id];
output[id] = input[id]; id++;
output[id] = input[id]; id++;
output[id] = input[id]; id++;
output[id] = input[id]; id++;
}
}
/////////////////////////////////////////////////////////
So, whether i m doing something wrong in creation of memory object ?
-Ankit.
Hi Ankit, I will check if there is any issue related and let you know. But I would recommend you stick with buffers instead, ,this is something I took from the Vivante documentation:
"Note that for improved performance with Vivante Embedded Profile cores, buffers should be used instead of read_image and write_image."
regards,
Andre
Hi Andre,
Thank you for your respond.
Now, in terms of Performance; When i execute copy_buffer (from above kernel code, for 800x480 width and height), it take approx 11.5 ms.
So, what will be approximate value (overhead for OpenCL in i.MX6Q ) OpenCL should take to Read from input memory and Write to output memory?
-Ankit
Hi Ankit, sorry for the late response, I will check that info and let you know.