Can Cortex-A57 dual-issue 128-bit neon instructions?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Can Cortex-A57 dual-issue 128-bit neon instructions?

1,352 Views
clemenseisserer
Contributor II

We are currently in the evaluation-stage of hardware development for our new image processing hardware, testing different CPU architectures - with one possible candidate beeingFreescale LS2xxx chips. Because (integer) SIMD throughput with hand-crafted assembly is a crucial factor for the workload we typically see, we've tried to benchmark the Cortex-A57 in order to measure the peak throughput.

The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).

However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock.

The benchmark is neon/asimd only, does not measure load/stores and does not mix/match asimd with conventional instructions.

Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?

Thank you in advance, Clemens

Labels (1)
Tags (1)
0 Kudos
3 Replies

1,059 Views
clemenseisserer
Contributor II

Update: With the benchmark code below, I get an IPC of roughly 1.4-1.5. All instructions used are listed with an execution throughput of 2.

Making the loop shorter or longer doesn't change the throughput. Any idea what is going wrong here?

.loop:

        vadd.i32 q2, q3, q2

        vadd.i32 q3, q4, q3

        vadd.i32 q4, q5, q4

        vadd.i32 q5, q6, q5     

        vadd.i32 q6, q7, q6

        vadd.i32 q7, q8, q7     

        vadd.i32 q8, q9, q8            

        vadd.i32 q9, q10, q9      

        vadd.i32 q10, q11, q10

        vadd.i32 q11, q12, q11     

        vadd.i32 q12, q13, q12

        vadd.i32 q13, q14, q13    

        vadd.i32 q14, q15, q14

        vadd.i32 q15, q1, q15  

                               

        vand q2, q3, q2

        vand q3, q4, q3    

        vand q4, q5, q4

        vand q5, q6, q5    

        vand q6, q7, q6

        vand q7, q8, q7    

        vand q8, q9, q8        

        vand q9, q10, q9   

        vand q10, q11, q10

        vand q11, q12, q11    

        vand q12, q13, q12

        vand q13, q14, q13     

        vand q14, q15, q14

        vand q15, q1, q15

       

subs r0, r0, #1

bne .loop

0 Kudos

1,059 Views
art
NXP Employee
NXP Employee

The possible cause of the issue is the instruction fetch overhead. To adequately view the dual instructions throughput, you have the test code and data to be completely loaded and locked in a single L1 cache lane, whose size is 64 bytes for the Cortex A-57 architecture. Please try it.


Have a great day,
Artur

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,059 Views
clemenseisserer
Contributor II

Hello Artur,

Thanks for your answer. The loop is executed 16M times, so the code should be in L1I - and there are no loads/stores, just register operations.

Br, Clemens

0 Kudos