Sporadic delay when using GPU with OpenCL

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Sporadic delay when using GPU with OpenCL

1,023 Views
peter_eberl
Contributor I

I've got a imx8mqevk board and developing GPU processing applications with OpenCL.

Sporadicly I see big delays in executing kernels on the GPU.

Problem is also visable with the gtec-demo-framework. Executing the FastFourierTransform DemoApp I get in some runs such values:

Kernel execution time on GPU (kernel 0): 0.000003 seconds
Kernel execution time on GPU (kernel 1): 0.000615 seconds
Kernel execution time on GPU (kernel 2): 0.000002 seconds
Kernel execution time on GPU (kernel 3): 0.000002 seconds
Total Kernel execution time on GPU: 0.000622 seconds

I would expect such values:

Kernel execution time on GPU (kernel 0): 0.000003 seconds
Kernel execution time on GPU (kernel 1): 0.000001 seconds
Kernel execution time on GPU (kernel 2): 0.000002 seconds
Kernel execution time on GPU (kernel 3): 0.000002 seconds
Total Kernel execution time on GPU: 0.000008 seconds

I'm using 

repo init -u https://source.codeaurora.org/external/imx/imx-manifest -b imx-linux-sumo -mimx-4.14.98-2.0.0_ga.xml

DISTRO=fsl-imx-xwayland MACHINE=imx8mqevk source fsl-setup-release.sh -b build-xwayland

Are there any problems within the imx-gpu-viv driver?

Or are there any other limitations?

Labels (1)
0 Kudos
6 Replies

771 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi, what is /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor on your board?

 

When I use performance or ondemand for your scaling_governor  on my imx8qm board like this 

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

The execution time is quite consistent. 

 

My default scaling_governor is scheutil on my imx8qm board, yes, for many executions, I can see the the big spike for execution time from 1400 us to 2600us .  But, most of time, they are consistent around 1500us.  I think it is caused by cpu speed, not gpu. performance governor mode will make sure the consistent high cpu speed for cpu and gpu interactions.

Regards

0 Kudos

771 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Also Please, add the following line:

 

    clFinish (commandQueue);

 

in clutil.cpp line 498 (end of runKernelFFT function) before the close statement. It will make the result more consistent. In my case I dont have the MQ board to test but I tested on 8QXP with fft length = 32:

 

Kernel execution time on GPU (kernel 0) : 0.000546 seconds
Kernel execution time on GPU (kernel 1) : 0.000655 seconds
Kernel execution time on GPU (kernel 2) : 0.000177 seconds
Kernel execution time on GPU (kernel 3) : 0.000650 seconds
Kernel execution time on GPU (kernel 4) : 0.000623 seconds
Total Kernel execution time on GPU : 0.002651 seconds

 

Kernel execution time on GPU (kernel 0) : 0.000544 seconds
Kernel execution time on GPU (kernel 1) : 0.000657 seconds
Kernel execution time on GPU (kernel 2) : 0.000176 seconds
Kernel execution time on GPU (kernel 3) : 0.000640 seconds
Kernel execution time on GPU (kernel 4) : 0.000621 seconds
Total Kernel execution time on GPU : 0.002638 seconds

 

Kernel execution time on GPU (kernel 0) : 0.000509 seconds
Kernel execution time on GPU (kernel 1) : 0.000644 seconds
Kernel execution time on GPU (kernel 2) : 0.000173 seconds
Kernel execution time on GPU (kernel 3) : 0.000628 seconds
Kernel execution time on GPU (kernel 4) : 0.000620 seconds
Total Kernel execution time on GPU : 0.002574 seconds

 

Kernel execution time on GPU (kernel 0) : 0.000541 seconds
Kernel execution time on GPU (kernel 1) : 0.000634 seconds
Kernel execution time on GPU (kernel 2) : 0.000180 seconds
Kernel execution time on GPU (kernel 3) : 0.000640 seconds
Kernel execution time on GPU (kernel 4) : 0.000621 seconds
Total Kernel execution time on GPU : 0.002616 seconds

 

Kernel execution time on GPU (kernel 0) : 0.000541 seconds
Kernel execution time on GPU (kernel 1) : 0.000643 seconds
Kernel execution time on GPU (kernel 2) : 0.000183 seconds
Kernel execution time on GPU (kernel 3) : 0.000631 seconds
Kernel execution time on GPU (kernel 4) : 0.000617 seconds
Total Kernel execution time on GPU : 0.002615 seconds

 

Kernel execution time on GPU (kernel 0) : 0.000540 seconds
Kernel execution time on GPU (kernel 1) : 0.000660 seconds
Kernel execution time on GPU (kernel 2) : 0.000180 seconds
Kernel execution time on GPU (kernel 3) : 0.000653 seconds
Kernel execution time on GPU (kernel 4) : 0.000627 seconds
Total Kernel execution time on GPU : 0.002660 seconds

Regards

0 Kudos

771 Views
peter_eberl
Contributor I

Thank you for your response.

I didn't change any input parameters for now.

If the execution time result was a signal, we could decompose it into 3 major components:

  1. Consistent execution time: ~10µs ± 5µs
  2. Small spikes: ~100µs ± 50µs
  3. Big spikes: ~2000µs ± 1000µs

using clFinish in the following manner:

clFinish(..);

gettimeofday(&start, NULL);

err = clEnqueueNDRangeKernel(..., &hEvent);

clFinish(..);

if(err != 0) { //..error check.. /}

gettimeofday(&end, NULL);

Using clFinish(..) results mostly in elimintation of small spikes, those around ~100µs.

What remains is a mostly consistent signal with 15% big spikes at ~1000µs-1500µs.

pastedImage_1.png

Measuring exec. time with your "direct" measurement approach (in contrast to profiling information via cl_event "hEvent"),

results in a similar picture of exec times, but with a higher percentage of "big spikes" and an offset of all values around ~500µs.

I will probably ignore the spikes for now since the max. exec. time is "only" 1.2ms, but if it stacks up with more complex functions i need to find additional solutions.

0 Kudos

771 Views
peter_eberl
Contributor I

Bio_TICFSL

pastedImage_5.png

The graph shows the kernel execution times of a simple vector-vector additions with vector size 1000.

The addition was executed 1000 times.

My problem is the variance of values:

Min. exec. time:  2E-6 seconds

Max exec. time:  9.61E-4 seconds

Median:              9E-5 seconds

Even with 10E4 executions the max exec. times vary to around 2ms.

Execution Values are extracted through clGetEventProfilingInf() method (end-start timestamps).

Questions:

What causes this extreme variance of ~500%?

Is yocto linux a real-time operation system? -> Are possibly interrupts a cause of delay of enqueueing?

Am i measuring the timestamps correctly?

0 Kudos

771 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

Are you changing any of the arguments passed to cl kernel for each kernel execution? are the kernel parameters the same?

what do you get if not using wait for event when running the cl kernel like this:

 gettimeofday(&start, NULL);  
  ret = clEnqueueNDRangeKernel (cq, kernel, dimension, NULL, global, local, 0, NULL, NULL);
  if  (ret == CL_SUCCESS)
  {
   printf( "\nReading data from GPU memory = ..\n");
  }else
  {
   printf( "\nKernel failed = ..\n");
  }  
  // Should be the barrier here?
  clFinish(cq);

gettimeofday(&end, NULL);
  //compute and print the elapsed time in millisec - For writting data into input buffer
  seconds  = end.tv_sec  - start.tv_sec;
  useconds = end.tv_usec - start.tv_usec;
  mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
  printf( "\n CL code = %ld ms\n", mtime);

Regards

0 Kudos

771 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello peter,

Don´t know why you are getting limitations, but try to run again the code. On my Mx8M, I get 0.000008s.

Regards

0 Kudos