Hello,
I am trying to figure out what is the best way to transfer output of VPU decoder into an OpenCL memory object. VPU uses non-cacheable memory, so memcpy reads from it are very slow, e.g. 600ms for 64MB. I tried mapped OpenCL memory objects created with CL_MEM_ALLOC_HOST_PTR and with memcpy to fill them in and also tried CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR when creating OpenCL memory objects, but all of them take the same 600ms just like reading from VPU buffer into malloc's buffer with memcpy. What am I missing here? Is it possible to do a DMA transfer into GPU?
Thanks,
Alex
I was able to speed it up 5.75x times by creating a cacheable buffer with g2_alloc(size,1) and doing g2d_blit from VPU buffer which takes less than 1ms and then doing
cl_mem clmem = clCreateBuffer(GPUContext,CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, c, (void *)g2d_cached->buf_vaddr, NULL);
For some reason the max allocation size for g2d_alloc with cacheable flag set to 1 is 32MB, but with the flag set to 0, there is no such limitation. So for my 64MB desired copy size I would have to the copy twice. While it helps, it does look like a hack and I would like to know if there is a better solution or why g2d_alloc has the 32MB limit. I tried increasing CONFIG_FORCE_MAX_ZONEORDER above 14 and tried memchunk kernel command line parameter, but that did not help.
EDIT: I forgot to use g2d_finish() after g2d_blit. After I added it, I started getting kernel crashes (L3.10.17) intermittently and the copy time increased by 25%. After I added
g2d_cache_op(g2d_cached_buf,G2D_CACHE_INVALIDATE ) before the g2d_blit, the copy time increase another 90%, so the total gain was only 2x compared to start, still with crashes:
[ 380.817312] Backtrace: [ 380.819786] [] (__delete_from_page_cache+0x0/0x100) from [] (delete_from_page_cache+0x38/0x74) [ 380.830145] r5:9084f2fc r4:80627fc0 [ 380.833762] [] (delete_from_page_cache+0x0/0x74) from [] (truncate_inode_page+0x84/0xb8) [ 380.843601] r7:00000000 r6:00001000 r5:9084f2fc r4:80627fc0 [ 380.849333] [] (truncate_inode_page+0x0/0xb8) from [] (shmem_undo_range+0x298/0x64c) [ 380.858822] r7:00000000 r6:ffffffff r5:0000052c r4:80627fc0 [ 380.864551] [] (shmem_undo_range+0x0/0x64c) from [] (shmem_truncate_range+0x2c/0x50) [ 380.874047] [] (shmem_truncate_range+0x0/0x50) from [] (shmem_evict_inode+0xc4/0x170) [ 380.883622] r7:9084f228 r6:9084f2cc r5:00000000 r4:9084f210 [ 380.889358] [] (shmem_evict_inode+0x0/0x170) from [] (evict+0xb4/0x19c) [ 380.897719] r9:9084f228 r8:9084f228 r7:803ca480 r6:9084f2cc r5:80526300 r4:9084f228 [ 380.905650] [] (evict+0x0/0x19c) from [] (iput+0xfc/0x184) [ 380.912880] r7:803ca480 r6:900e4c00 r5:9084f278 r4:9084f228 [ 380.918611] [] (iput+0x0/0x184) from [] (d_kill+0x120/0x168) [ 380.926015] r7:8e2ff6b4 r6:00000000 r5:9084f228 r4:8e2ff660 [ 380.931744] [] (d_kill+0x0/0x168) from [] (dput+0x120/0x210) [ 380.939148] r6:00000000 r5:9084f228 r4:8e2ff660 [ 380.943820] [] (dput+0x0/0x210) from [] (__fput+0x11c/0x248) [ 380.951225] r7:9008d910 r6:8e2ff660 r5:00000000 r4:905a93c0 [ 380.956951] [] (__fput+0x0/0x248) from [] (____fput+0x10/0x14) [ 380.964544] [] (____fput+0x0/0x14) from [] (task_work_run+0xb8/0xf4) [ 380.972658] [] (task_work_run+0x0/0xf4) from [] (do_exit+0x2b4/0x990) [ 380.980844] r7:90a66020 r6:90a66000 r5:90462800 r4:90462b20 [ 380.986573] [] (do_exit+0x0/0x990) from [] (do_group_exit+0x54/0xd4)[ 380.994672] r7:90a67edc [ 380.997240] [] (do_group_exit+0x0/0xd4) from [] (get_signal_to_deliver+0x260/0x5d0) [ 381.006642] r7:90a67edc r6:90a66038 r5:00000009 r4:90a66000 [ 381.012380] [] (get_signal_to_deliver+0x0/0x5d0) from [] (do_signal+0x94/0x468) [ 381.021442] [] (do_signal+0x0/0x468) from [] (do_work_pending+0x6c/0xac) [ 381.029896] [] (do_work_pending+0x0/0xac) from [] (work_pending+0xc/0x20) [ 381.038430] r7:00000036 r6:00007530 r5:76b6c198 r4:00002710 [ 381.044158] Code: eb0056d1 e594300c e3530000 baffffdc (e7f001f2) [ 381.050262] ---[ end trace a5ce7d4c43e2d135 ]---
Hello,
You wrote : “VPU uses non-cacheable memory, so memcpy reads from it
are very slow […]” and “I was able to speed it up 5.75x times by creating
a cacheable buffer […]”. Basically this is general issue, that non-cacheable
copy is slow when comparing with cacheable one. And Your approach
regarding cacheable copy usage may be quite reasonable for Your application.
Nevertheless, please analyze carefully such approach – if it will not provide
coherency problems.
Have a great day,
Yuri
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------
Hello Yuri,
Thank you for your feedback. I should have no coherency issues as I control the processing pipeline and know exactly when VPU or OpenCL processing is done and memory can be accessed and since it is not cached, it should be safe.
I really wanted to have a way to do DMA transfers between VPU and GPU, just like it is possible to do between VPU and G2D or IPU. There seems to be no official way to do this and Vivante OpenCL driver does not support zero copy with CL_MEM_ALLOC_HOST_PTR like other vendors which would be an alternative (https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performanc...)
I have found a hack that seems to work great, here it is. OpenCL memory buffer cl_mem is a pointer to an opaque structure _cl_mem. Using debug versions of GPU lib and stepping through code in Eclipse I found that _cl_mem has a pointer to physical address of the buffer located at offset 72 (decimal). It must be mentioned that _cl_mem struct has a union of buffer struct and image struct, I have not checked if the offset is different when cl_mem is created as an image as I do not use it. Having a physical address of OpenCL buffer, I can use g2d_blit to copy between VPU/G2D/IPU buffers that expose physical address and OpenCL buffer. Here is how I do it:
clEnqueueMapBuffer() to let OpenCL finish whatever it needs if anything
unsigned char *ptr = (unsigned char *) cl_mem_buffer;
ptr += 72;
int phys_addr = *((int *)ptr);
g2d_blit()
g2d_finish()
clEnqueueUnmapMemObject()
This code copies 45MB buffer in 28ms with correct results. For comparison, the following code which is an official way of doing this takes 450ms:
clEnqueueMapBuffer()
memcpy()
clEnqueueUnmapMemObject()
I am wondering if there are any drawbacks to this hack that I am not aware of and if not, then maybe there should be a vendor extension for OpenCL available to do such copying.
Regards,
Alex
Hi Alex,
I'm currently trying to implement an application using OpenCL in an iMX6 and while doing a research I found your post and I was wondering where did you find the vivante drivers with debugging symbols?
I would like to reproduce your test as in a quick test I ran I'm creating 2 cl_mem objects, when I get the physical address for the first one following your steps I get a non zero value as expected, however the second one is always 0. I know both memory objects are properly created as the application works properly, my final goal is to reduce/eliminate the memory copies.
I'm using kernel 3.14.28 and imx-gpu-viv-5.0.11.p4.1-hfp.
Thanks for your help.
Regards,
Edison