Hi Anusree,
Are you using GPU vectors for initialize the CPU memory or GPU memory?, also you can copy the output in GPU memory to CPU memory with command clEnqueueReadBuffer, this should improve the performance.
In any case you are using atomic functions, these are currently are disable from OpenCL compiler, . If the kernel use it, the compiler will prompt error.
Examples can be found on the Graphics reference manual on the BSP documentation.
Hope this helps