I am trying to figure out what is the best way to transfer output of VPU decoder into an OpenCL memory object. VPU uses non-cacheable memory, so memcpy reads from it are very slow, e.g. 600ms for 64MB. I tried mapped OpenCL memory objects created with CL_MEM_ALLOC_HOST_PTR and with memcpy to fill them in and also tried CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR when creating OpenCL memory objects, but all of them take the same 600ms just like reading from VPU buffer into malloc's buffer with memcpy. What am I missing here? Is it possible to do a DMA transfer into GPU?