opencl: imx8qxp with GPU vivante gc7000ul -> clCreateKernel too slow

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

opencl: imx8qxp with GPU vivante gc7000ul -> clCreateKernel too slow

4,765 次查看
terzibaschian
Contributor II

Hello everyone!

As I cannot find anything about this online, nor can I find an adequate documentation about neither the tools nor the hardware I decided to ask in this forum - even though I am not sure if this is the right place, as it is more of a vivante GPU issue then related to NXP.

Anyways:

I am working on an opencl 1.2 project using the vivante gpu on an imx8 board.

My project consists of a few opencl man-optimized kernels, all precompiled binaries on my development pc with the opencl compiler vcCompiler from the VTK you can find in the recent software section of the imx8qxp board . Luckily running performance of them is super good, makes the CPU free to do other stuff.

Unfortunately at the startup of the program I need to call clCreateKernel and even though I am using clCreateKernel with cl-programs pre-built as binary all 7 clCreateKernel calls take about 30!!~40!!! seconds. This is not only very annoying for debugging but eventually renders the whole software to be not usable. It needs to be working 1-2 seconds after starting the device otherwise the use-case is simply not there.

This issue looks similar for INTEGRITY and linux.

So my question: Is there a way to store/precompile opencl 1.2 kernels to speedup clCreateKernel, maybe even some native galcore functions that can be called to store the cl_kernel objects to disk after clCreateKernel has been called?

Thanks for any help in advance!

标签 (1)
标记 (3)
5 回复数

3,717 次查看
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi,

from our documentation:

------------

The application running on the host uses the OpenCL API to create memory objects in global memory, and to enqueue memory commands that operate on these memory objects. The host and OpenCL device memory models are, for the most part, independent of each other. This is by necessity as the host is defined outside of OpenCL. They do, however, at times need to interact. This interaction occurs in one of two ways: by explicitly copying data from the host to the GPU compute device memory, or implicitly, by mapping and unmapping regions of a memory object.

 

Explicit using clEnqueueReadBuffer and clEnqueueWriteBuffer (clEnqueueReadImage, clEnqueueWriteImage.)
To copy data explicitly, the host enqueues commands to transfer data between the memory object and host memory. These memory transfer commands may be blocking or non-blocking. The OpenCL function call for a blocking memory transfer returns once the associated memory resources on the host can be safely reused. For a non-blocking memory transfer, the OpenCL function call returns as soon as the command is enqueued regardless of whether host memory is safe to use.
• Implicit using clEnqueueMapBuffer and clEnqueueUnMapMemObject.
The mapping/unmapping method of interaction between the host and OpenCL memory objects allows the host to map a region from the memory object into its address space. The memory map command may be blocking or non-blocking. Once a region from the memory object has been mapped, the host can read or write to this region. The host unmaps the region when accesses (reads and/or writes) to this mapped region by the host are complete.

--------------

 

how you are checking if ti is blocked or not ? Is clFlush or clFinish being used just after clEnqueueRead/Map function ? 

 

I am just checking because we are not aware of any issue like that.

Regards

0 项奖励
回复

3,717 次查看
terzibaschian
Contributor II

Hi,

We experience longer clCreateKernel times both using the preinstalled linux opencl-stack (not sure which exact version this is, but it comes with the imx8qxp board), and additionally with INTEGRITY 11.7.4 with vivante galcore-kernel.

I can confirm that 6ms for clCreateKernel for a simple copy kernel with our measurements.

Unfortunately the specific opencl-code that takes a longer time cannot be made public so all I can give you is some statistics: It basically is a program containing 4 special cases for convolution, each kernel ~170 lines of man-optimized opencl-code. Seperating them in single programs resulted in a slight clCreateKernel speedup of 20%, but still it takes several seconds.

We are already using the vcCompiler to precompile the sources. The precompiled kernel .clgcSL-file is around 190kb (which we pass to clCreateProgramWithBinary). Also another file with extension .gcPGM ~ 40kb is generated where we are not 100% sure what this is useful for. And despite the precompilation clCreateKernel takes several seconds.

So any hints on what the kernel creation time really depends on and how we could further speed this up would be a big help as clCreateKernel taking several seconds on startup can be a use-case killer here! (is it possible to persist the state of the galcore maybe? or any other dirty tricks?)

Thanks in advance!

0 项奖励
回复

3,717 次查看
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi

The hint I was going to give seems like you already tried it, which is breaking the big kernel in smaller ones. I will check internally if we already have an opened issue like this with VSI, we are already aware of some performance or optimization issues with clCreateKernel, but I don't know what is the status with VSI. 

Regards

0 项奖励
回复

3,717 次查看
terzibaschian
Contributor II

Hi,

thanks for your time. Actually we managed to get overall clCreateKernel time down to ~3seconds by splitting and restructuring the code - unfortunately it is not that super fast during runtime now anymore (140ms with the long code vs 300ms with the shorter code) - but that is the best compromise we could reach here before deciding not to use the GPU at all.
Actually for debugging purposes I guess it is sufficient to generate a long foobar.cl-kernel - it will probably give similar clCreateKernel times.

Actually one last question mildly related that popped up:

clEnqueueRead or clEnqueueMap both seem to be blocking calls even though the opencl documentation says it should not - so it is basically ignoring the the CL_TRUE/CL_FALSE argument passed to the blocking flag.
Is this a hardware issue (GPU has no DMA, so maybe it needs to use the CPU for that) or also related to the vivante opencl implementation? (this holds again for yocto and integrity). we could circumvent this by having the read in a seperate CPU thread, just interested if this might even be a bug?

Best regards

0 项奖励
回复

3,717 次查看
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi Arvid,

Which BSP are you using ? the OpenCL tests on imx8qxp for a simple copy kernel it takes 6ms on clCreateKernel.  clCreateKernel time can increase considerably depending on the kernel size and if its optimized or not. Can you share the kernel so I can take a look ?

Regards

0 项奖励
回复