Thank you for your response.
I didn't change any input parameters for now.
If the execution time result was a signal, we could decompose it into 3 major components:
- Consistent execution time: ~10µs ± 5µs
- Small spikes: ~100µs ± 50µs
- Big spikes: ~2000µs ± 1000µs
using clFinish in the following manner:
clFinish(..);
gettimeofday(&start, NULL);
err = clEnqueueNDRangeKernel(..., &hEvent);
clFinish(..);
if(err != 0) { //..error check.. /}
gettimeofday(&end, NULL);
Using clFinish(..) results mostly in elimintation of small spikes, those around ~100µs.
What remains is a mostly consistent signal with 15% big spikes at ~1000µs-1500µs.

Measuring exec. time with your "direct" measurement approach (in contrast to profiling information via cl_event "hEvent"),
results in a similar picture of exec times, but with a higher percentage of "big spikes" and an offset of all values around ~500µs.
I will probably ignore the spikes for now since the max. exec. time is "only" 1.2ms, but if it stacks up with more complex functions i need to find additional solutions.