Hi community,
I have a question about i.MX6DQ OpenGL rendering with Yocto BSP (L3.10.17_1.0.0-ga).
Actually, rendering speed is not enough for 1920x1080 resolution now even though it is ok for XGA (1024x768).
Could you give me some advice to improve OpenGL rendering speed or how to investigate what is wrong?
Best Regards,
Satoshi Shimoda
I found the following OpenGL ES Tips and Tricks and general CPU optimization considerations :
1. General Tips for i.MX6 :
Minimize state changes
Use uncompressed textures
Batch your calls as much as possible
Avoid glFinish.
Use Triangle Strip
Consider glDrawElements instead of glDrawArrays
Use VBOs instead of re-submitting vertices
Prevent uploads (VBOs, TexImage2D, etc.)
Optimize your shaders
reduce branching (if-else)
keep the code simple
avoid using functions
some math calls are costly
2. Other Graphics Tips and Tricks :
Keep an eye on your CPU load
In Linux, use ‘top’
Keep an eye on the Memory Bandwidth
In Linux, use ‘mmdc’ profiling tool
High CPU/BW load can be indicative of bad API usage
You should try to avoid data ‘uploads’ (either texture or vertex) on a frame-by-fame basis
Use VBOs instead of arrays in your draw calls
Use DirectVIV / PBOs / EGL Images instead of teximage data
3. CPU Optimizations.
3.1.
Before measuring any performance on ARM CPU(s), it is recommended to disable the dynamic frequency scaling:
# echo performance > /sys/devices/system/cpu/cpu0/cpufreq scaling_governor
Else, anytime your computing threads are sleeping or waiting for an interrupt,
the Power Management may enter a state with lower CPU frequency, thus degrading your performance.
3.2.
The ARM cores in i.MX6 have Neon units that can run SIMD instructions (Single Instruction Multiple Data).
Basic steps to enable Neon in gcc.
In compile flags:
LDFLAGS = -fmpu=neon –O3
Lets the compiler optimize the code using neon.
More flags: fast-math, unsafe-math, unsafe-loop-optimizations
Best practices to allow compiler better vectorize data in your loops:
Have : countable loops, independent and continuous data accesses
Ex: gather data in a struct of arrays, rather than array of structs.
Avoid: break-continue, if-else, unrolling manually loops.
Use C intrinsics
C function call interface to NEON operations
Supports all data types and operations supported by NEON
Full list http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
More info at http://armneon.blogspot.com/
3.3.
The ARM cores in i.MX6 Dual and Quad can run your algo in Parallel by using OpenMP compile directives.
Basic steps to enable OpenMP with gcc.
In compile flags:
LDFLAGS = -fopenmp
Install on target the library libgomp.so
In source file:
#include <omp.h>
Put your code to parallelize into {}
Just before the {}, add:
#pragma omp parallel
If your code is a loop, just before the loop add:
#pragma omp parallel for
Disambigue variable visibility across threads:
shared(var1,var2,…) private(var3,var4,…)
More info at http://openmp.org/wp/resources/#Tutorials
Have a great day,
Yuri
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------
Performance depends on many factors. One thing would recommend is use eglImage Extension.
When you use conventional image and textures, it involves copy operation which will reduce the performance. I suggest you to Try to use eglImage extension, Note that the recipes in meta-browser now contain packageconfigs to enable EGL support. You don't need to pass this parameter then.
Have a great day,
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------