- Reference Board: SMARC-sAMX6i
- Kernel version: 3.10.17-rel1.0+g232293e
- Problem Description: using CL_MEM_ALLOC_HOST_PTR, clEnqueueMapBuffer and clEnqueueUnmapMemObject properly should allow fast sharing of data between host and device, but instead it provides very slow access to the mapped memory.
- Intel provides some examples of using zero copy buffers in OpenCL: Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics …
- always reproducible when using CL_MEM_ALLOC_HOST_PTR, clEnqueueMapBuffer and clEnqueueUnmapMemObject
I am noticing terrible performance using zero copy buffers on the i.MX6 as can be seen in the following benchmarks:
# host and device use separate buffers
NEON framerate : 100.069185
OpenCL framerate : 751.673579
# host and device use same buffer
NEON framerate : 40.988317
OpenCL framerate : 48.976948
I believe that the host pointer allocated by the Vivante driver is uncached and causing the terrible performance we are seeing. This is highly unfortunate in the case where CPU and GPU shared memory could be leveraged for greater performance.