Issue description:
I am noticing terrible performance using zero copy buffers on the i.MX6 as can be seen in the following benchmarks:
# host and device use separate buffers
NEON framerate : 100.069185
OpenCL framerate : 751.673579
# host and device use same buffer
NEON framerate : 40.988317
OpenCL framerate : 48.976948
I believe that the host pointer allocated by the Vivante driver is uncached and causing the terrible performance we are seeing. This is highly unfortunate in the case where CPU and GPU shared memory could be leveraged for greater performance.