iMX6 EIM transfer speed with using NEON vld/vst instructions

torus1000 · ‎04-07-2014

Hi.
I tried to estimate EIM transfer rate then measured burst length and CS interval of 32b EIM memory when following NEON code was running.
The results are:
VLDM(32b EIMmem read): 16burst x2, 700ns from burst end to next burst
VSTM(32b EIMmem write): 2burst x16, 455ns from burst end to next burst

Could someone explain why burst length and above period are different?
Is there any special EIM or NIC-301 setting need for such NEON codes?

---------- my code -------------

NEONCopyPLD:

PLD [r1, #0x80]

VLDM r1!, {d0-d15} #EIMmem read 128byte

PLDW [r0, #0x80]

VSTM r0!, {d0-d15} #EIMmem write 128byte

SUBS r2, r2, #0x80

BGT NEONCopyPLD

Yuri · ‎04-09-2014

As for timing / cycle estimations for different ARM Cortex –A9 NEON

instructions we should apply to section 3.1 (About instruction cycle timing)

of “Cortex-A9 NEON Media Processing Engine Technical Reference Manual” :

“The complexity of the Cortex-A9 processor makes it impossible to guarantee precise

timing information with hand calculations. The timing of an instruction is often affected

by other concurrent instructions, memory system activity, and events outside the instruction

flow.”

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/Babbgjhi.html

Next, according to Table 3.3 (VFP load and store instruction timing), generally, timings for

read / write operations for VLDM / VSTM are not different. But, note, data for 128-bit NEON

operations are loaded / stored from / to the caches. Therefore – first – the caches should be

enabled, and – the second – cache policy may affect performance of cache related operations.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388i/Babeaijd.html

As for system bus arbitration, according to Chapter 45 [Network Interconnect Bus System

(NIC-301)] of the i.MX6 DQ RM, “The NIC-301 default settings are configured by Freescale's

board support package (BSP), and in most cases should not be modified by the customer.

The default settings have gone through exhaustive testing during the validation of the part,

and have proven to work well for the part's intended target applications. Changes to the default

settings may result in a degradation in system performance.”

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

View solution in original post

igorpadykov · ‎04-09-2014

Hi torus1000

BTW what do you mean by "16burst x2" ?

Is it 16 bursts each two beats or opposite : 2 bursts each 16 beats ?

It may be recommended to try test with SDK, without operating system,

check with and without caches. In fact supported AXI transfers may be different

for read and write, with and without caches. You can check for example

sect.8.1.2 "Supported AXI transfers" ARM document DDI0388I_ cortex_a9_r4p1_trm.pdf.

Also NIC301 itself can split bursts, seems NIC301 can not produce more that 16 beat bursts

(WRAP16), according to sect.2.3.2 "Downsizing data" ARM document

NIC-301 Network Interconnect r2p3.pdf. NIC downsizes 64bit width bus

from core to 32bit EIM width bus.

Regarding "period" between busrts - seems this is not predictable, since

NIC is highly integrated structure it can introduce latencies about

hundreds of ns.

For some peripherals, such as IPU, bus optimizations for passing big packets

are provided, for example IOMUXC_GPR7 or IOMUXC_GPR4. However for

EIM there are no such optimization settings since it is considered as low speed

interface.

Yuri · ‎04-09-2014

As for timing / cycle estimations for different ARM Cortex –A9 NEON

instructions we should apply to section 3.1 (About instruction cycle timing)

of “Cortex-A9 NEON Media Processing Engine Technical Reference Manual” :

“The complexity of the Cortex-A9 processor makes it impossible to guarantee precise

timing information with hand calculations. The timing of an instruction is often affected

by other concurrent instructions, memory system activity, and events outside the instruction

flow.”

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/Babbgjhi.html

Next, according to Table 3.3 (VFP load and store instruction timing), generally, timings for

read / write operations for VLDM / VSTM are not different. But, note, data for 128-bit NEON

operations are loaded / stored from / to the caches. Therefore – first – the caches should be

enabled, and – the second – cache policy may affect performance of cache related operations.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388i/Babeaijd.html

As for system bus arbitration, according to Chapter 45 [Network Interconnect Bus System

(NIC-301)] of the i.MX6 DQ RM, “The NIC-301 default settings are configured by Freescale's

board support package (BSP), and in most cases should not be modified by the customer.

The default settings have gone through exhaustive testing during the validation of the part,

and have proven to work well for the part's intended target applications. Changes to the default

settings may result in a degradation in system performance.”

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

sumeetdube · ‎03-27-2015

I am having a same problem. I used vldm and vstm instruction

vldm %[address], {d0-d15} (address=fpga_address),

vstm %[address1],{d0-d15}.(address1=local_buffer address)

(port_size=32)

The problem i am facing is i get a burst clock of 32 cycles on each chip select and when i execute above instruction i get two chip selects.(16 words each).

I get data in my buffer on 16th clock cycle of the 32 clock cycles(on each chip select) and thus i miss starting 16 words of data.

What am i doing wrong here?

My EIM setting:

CS1GCR1:0x31130335

CS1GCR2:0x00000000

CS1RCR1:0x10000000

CS1RCR2:0x00000000

CS1WCR1:0x01630c00

CS1WCR2:0x00000000

torus1000 · ‎04-10-2014

Dear Yuri,

Bingo! You are right.

Finally I found the reason in 8.1.2 Supported AXI transfers as you pointed out.

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf

INCR N (N:1-16) 32-bit read transfers

INCR N (N:1-2) for 32-bit write transfers

iMX6 EIM transfer speed with using NEON vld/vst instructions

iMX6 EIM transfer speed with using NEON vld/vst instructions

i.MX6_All