I wrote a very simple kernel to check the performance of the GPU on iMX6 dual core in which I was performing a multiple operation million number of times using variables defined in the private register. I ran the kernel with different number of global work group size using different number of work items. The best performance I got was when I was launching 64 work groups with 64 work items in each of them. The maximum GFLOPS that I got was slightly higher than 3 at 500 MHz whereas Vivante GPU is suppose to work at 24 GFLOPS. I tried this using float16. Is there any way other than increasing the clock frequency for increasing the performance of GPU? Hasty reply will be appreciated.