AnsweredAssumed Answered

Neon vector stores slow depending on destination

Question asked by Philipp Sasse on Mar 14, 2017

Hello,

 

I have written a neon-optimized box filter in assembler for my i.MX6 system. I know about the memory bandwidth problems of the machine, but this doesn't explain my observation:

 

This part of my code (inline assembler)

 

        "loopSlide: \n\t"
        "vld1.16 {q0-q1}, [%[add]]! \n\t"
        "vld1.16 {q2-q3}, [%[add]]! \n\t"
        "vsra.u16 q6, q0, #5 \n\t"
        "vsra.u16 q7, q1, #5 \n\t"
        "vsra.u16 q8, q2, #5 \n\t"
        "vsra.u16 q9, q3, #5 \n\t"
        "vld1.16 {q0-q1}, [%[sub]]! \n\t"
        "vld1.16 {q2-q3}, [%[sub]]! \n\t"
        "vshr.u16 q0, q0, #5 \n\t"
        "vsub.u16 q6, q6, q0 \n\t"
        "vshr.u16 q1, q1, #5 \n\t"
        "vsub.u16 q7, q7, q1 \n\t"
        "vst1.16 {q6-q7}, [%[sub]]! \n\t"
        "vshr.u16 q2, q2, #5 \n\t"
        "vsub.u16 q8, q8, q2 \n\t"
        "vshr.u16 q3, q3, #5 \n\t"
        "vsub.u16 q9, q9, q3 \n\t"
        "vst1.16 {q8-q9}, [%[sub]]! \n\t"
        "add %[add], %[add], %[inc] \n\t"
        "add %[sub], %[sub], %[inc] \n\t"
        "add %[dst], %[dst], %[inc] \n\t"
        "cmp %[src], %[end] \n\t"
        "bne loopSlide \n\t"

takes 105 ms for the whole buffer, which results in insane 25 cpu cycles per command!

 

Removing only the vst commands, the algorithm speeds up to 9.5 ms, which fits my expectation on the memory bandwidth.

 

Now I tried exchanging `sub` and `dst` (which is nonsense from algorithmic view of course), and was puzzled: 28 ms for the same amount of loads and stores, just exchanging source and target buffers.

 

Both buffers have 512-bit-alignment and reside in the same memory region.

 

Do you have any idea what could be the cause of the problem or what to try to further examine it? Thank you!

Outcomes