i.MX6 OpenCL Zero Copy Buffer Usage is Slow

smrhein · ‎08-13-2015

Issue description:

Reference Board: SMARC-sAMX6i
Kernel version: 3.10.17-rel1.0+g232293e
Problem Description: using CL_MEM_ALLOC_HOST_PTR, clEnqueueMapBuffer and clEnqueueUnmapMemObject properly should allow fast sharing of data between host and device, but instead it provides very slow access to the mapped memory.
Intel provides some examples of using zero copy buffers in OpenCL: Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel®...
always reproducible when using CL_MEM_ALLOC_HOST_PTR, clEnqueueMapBuffer and clEnqueueUnmapMemObject

I am noticing terrible performance using zero copy buffers on the i.MX6 as can be seen in the following benchmarks:

# host and device use separate buffers

NEON framerate : 100.069185

OpenCL framerate : 751.673579

# host and device use same buffer

NEON framerate : 40.988317

OpenCL framerate : 48.976948

I believe that the host pointer allocated by the Vivante driver is uncached and causing the terrible performance we are seeing. This is highly unfortunate in the case where CPU and GPU shared memory could be leveraged for greater performance.

i.MX6 OpenCL Zero Copy Buffer Usage is Slow

i.MX6 OpenCL Zero Copy Buffer Usage is Slow

Graphics & Display

i.MX6Quad

Linux

Suspected Software Defect

Yocto Project