We are currently in the evaluation-stage of hardware development for our new image processing hardware, testing different CPU architectures - with one possible candidate beeingFreescale LS2xxx chips. Because (integer) SIMD throughput with hand-crafted assembly is a crucial factor for the workload we typically see, we've tried to benchmark the Cortex-A57 in order to measure the peak throughput.
The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).
However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock.
The benchmark is neon/asimd only, does not measure load/stores and does not mix/match asimd with conventional instructions.
Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?
Thank you in advance, Clemens
Update: With the benchmark code below, I get an IPC of roughly 1.4-1.5. All instructions used are listed with an execution throughput of 2.
Making the loop shorter or longer doesn't change the throughput. Any idea what is going wrong here?
.loop:
vadd.i32 q2, q3, q2
vadd.i32 q3, q4, q3
vadd.i32 q4, q5, q4
vadd.i32 q5, q6, q5
vadd.i32 q6, q7, q6
vadd.i32 q7, q8, q7
vadd.i32 q8, q9, q8
vadd.i32 q9, q10, q9
vadd.i32 q10, q11, q10
vadd.i32 q11, q12, q11
vadd.i32 q12, q13, q12
vadd.i32 q13, q14, q13
vadd.i32 q14, q15, q14
vadd.i32 q15, q1, q15
vand q2, q3, q2
vand q3, q4, q3
vand q4, q5, q4
vand q5, q6, q5
vand q6, q7, q6
vand q7, q8, q7
vand q8, q9, q8
vand q9, q10, q9
vand q10, q11, q10
vand q11, q12, q11
vand q12, q13, q12
vand q13, q14, q13
vand q14, q15, q14
vand q15, q1, q15
subs r0, r0, #1
bne .loop
The possible cause of the issue is the instruction fetch overhead. To adequately view the dual instructions throughput, you have the test code and data to be completely loaded and locked in a single L1 cache lane, whose size is 64 bytes for the Cortex A-57 architecture. Please try it.
Have a great day,
Artur
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------
Hello Artur,
Thanks for your answer. The loop is executed 16M times, so the code should be in L1I - and there are no loads/stores, just register operations.
Br, Clemens