Can Cortex-A57 dual-issue 128-bit neon instructions?

clemenseisserer · ‎12-01-2015

We are currently in the evaluation-stage of hardware development for our new image processing hardware, testing different CPU architectures - with one possible candidate beeingFreescale LS2xxx chips. Because (integer) SIMD throughput with hand-crafted assembly is a crucial factor for the workload we typically see, we've tried to benchmark the Cortex-A57 in order to measure the peak throughput.

The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).

However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock.

The benchmark is neon/asimd only, does not measure load/stores and does not mix/match asimd with conventional instructions.

Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?

Thank you in advance, Clemens

clemenseisserer · ‎12-02-2015

Update: With the benchmark code below, I get an IPC of roughly 1.4-1.5. All instructions used are listed with an execution throughput of 2.

Making the loop shorter or longer doesn't change the throughput. Any idea what is going wrong here?

.loop:

vadd.i32 q2, q3, q2

vadd.i32 q3, q4, q3

vadd.i32 q4, q5, q4

vadd.i32 q5, q6, q5

vadd.i32 q6, q7, q6

vadd.i32 q7, q8, q7

vadd.i32 q8, q9, q8

vadd.i32 q9, q10, q9

vadd.i32 q10, q11, q10

vadd.i32 q11, q12, q11

vadd.i32 q12, q13, q12

vadd.i32 q13, q14, q13

vadd.i32 q14, q15, q14

vadd.i32 q15, q1, q15

vand q2, q3, q2

vand q3, q4, q3

vand q4, q5, q4

vand q5, q6, q5

vand q6, q7, q6

vand q7, q8, q7

vand q8, q9, q8

vand q9, q10, q9

vand q10, q11, q10

vand q11, q12, q11

vand q12, q13, q12

vand q13, q14, q13

vand q14, q15, q14

vand q15, q1, q15

subs r0, r0, #1

bne .loop

art · ‎12-03-2015

The possible cause of the issue is the instruction fetch overhead. To adequately view the dual instructions throughput, you have the test code and data to be completely loaded and locked in a single L1 cache lane, whose size is 64 bytes for the Cortex A-57 architecture. Please try it.

Have a great day,
Artur

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

clemenseisserer · ‎12-03-2015

Hello Artur,

Thanks for your answer. The loop is executed 16M times, so the code should be in L1I - and there are no loads/stores, just register operations.

Br, Clemens

Can Cortex-A57 dual-issue 128-bit neon instructions?

Can Cortex-A57 dual-issue 128-bit neon instructions?

QorIQ LS2 Devices