The G2D API to the 2D GPU on i.MX8 family of processors (specifically, those with GC520L GPUs) seems like the hardware should be capable of performing interleaving in a cache-efficient (bandwidth-efficient) manner.
If I have 4 arrays of 128K length containing int16 values, and I wish to interleave these to produce one array of length 128K, containing four int16 values, the 2D GPU blitter seems perfect: set up the problem with a 16-bit pixel format such as BGR565, and treat the four source arrays as 1-wide, 128K-high images, and the output array as a 4-wide, 128K-high image (actually, due to the 32K by 32K raster size limitation, we would split the problem into 4 parts, but still able to be done off-CPU). However, the documentation IMXGRAPHICUG.pdf for g2d_multi_blit says "The key restriction is that the destination rectangle cannot be set, which means that the destination rectangle must be the same as the source rectangle."
This seems to prevent use in this use-case - an alternative possibility involving writing rows, and then rotating, would take two passes, because the documentation also states that the destination surface can't have non-zero rotation for g2d_multi_blit. To do this we would have to use g2d_multi_blit to build our four arrays into rows of a 4-high, 128K-wide image, and then use g2d_blit to rotate this by 90 degrees (as previously, we would need to split into 4 to respect the 32K raster size limit). Whilst getting it off-CPU using this method, it is needlessly using twice as much memory bandwidth.
Is there a better way?