We are currently in the evaluation-stage of hardware development for our new image processing hardware, testing different CPU architectures - with one possible candidate beeingFreescale LS2xxx chips. Because (integer) SIMD throughput with hand-crafted assembly is a crucial factor for the workload we typically see, we've tried to benchmark the Cortex-A57 in order to measure the peak throughput.
The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).
However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock.
The benchmark is neon/asimd only, does not measure load/stores and does not mix/match asimd with conventional instructions.
Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?
Thank you in advance, Clemens