About NEON burst read/write by assembler instruction on i.MX6Q.

keitanagashima · ‎04-01-2014

Dear Sir or Madam,

Hi.

I'd like to realize below transfer on i.MX6Q.

128Byte burst load/store with 64bit double precision registers(d0-d15) in ARM Cortex-A9 by using assembler instruction.

[Observed Result]

Write:

(2-beat) x 16

Read

(16-beat) x2

[Assembler Code]

NEONCopyPLD:

PLD [r1, #0x80]

PLDW [r0, #0x80]

VSTM r0!, {d0-d15}

SUBS r2, r2, #0x80

BGT NEONCopyPLD

[Question]

=Q1=

Is there an assembler instruction which is possible about 32-burst transfer except the VLDMVSTM direction?

=Q2=

In case of "Write" operation, it became (2 beat) x 16.

What setting is necessary for the 16-beat burst transfer?

Keita

Yuri · ‎04-06-2014

As for timing / cycle estimations for different ARM Cortex –A9 NEON

instructions we should apply to section 3.1 (About instruction cycle timing)

of “Cortex-A9 NEON Media Processing Engine Technical Reference Manual” :

“The complexity of the Cortex-A9 processor makes it impossible to guarantee precise
timing information with hand calculations. The timing of an instruction is often affected
by other concurrent instructions, memory system activity, and events outside the instruction

flow.”

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/Babbgjhi.html

Next, according to Table 3.3 (VFP load and store instruction timing), generally, timings for

read / write operations for VLDM / VSTM are not different. But, note, data for 128-bit NEON
operations are loaded / stored from / to the caches. Therefore – first – the caches should be

enabled, and – the second – cache policy may affect performance of cache related operations.

ARM Information Center

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

keitanagashima · ‎04-09-2014

Dear Yuri,

Hello.

Thank you for your reply.

I enabled L1/L2 caches but, it wasn't possible to transfer burst operation correctly.

[Info 1]

Executing burst write to LPDDR(r1) --> EIM(r0) with 32 bit.

[Info 2]

I tested the following two ways but the burst count was to be as 2-beat bursts.

I considered an influence by LPDDR2 and deleted "VLD1.64" and "LDMIA".

But, It didn't improve.

//Way 1

__asm__ volatile(

　 "NEONCopyPLD_2:\n"

　 "PLD [r1, #0xC0]\n"

　 "VLD1.64 {d0,d1,d2,d3}, [r1]!\n"

　 "VST1.64 {d0,d1,d2,d3}, [r0]\n"

　 "SUBS r2, r2, #0x80\n"

　 "BGT NEONCopyPLD_2\n"

　　 );

//Way 2

__asm__ volatile(

　"PUSH {r3-r10}\n"

　 "LDMloop2:\n"

"PLD [r1, #0x80]\n"

　"LDMIA r1!, {r3 - r10}\n"

　"STMIA r0, {r3 - r10}\n"

　"SUBS r2, r2, #32\n" /* 8個(r3-r10) * 4Byte(32bit) = 32Byte */

　"BGE LDMloop2\n"

　"POP {r3-r10}\n"

　);

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

[Info 3]

When transferring in EIM→EIM for debug purpose, it sometimes possible to transfer by 8-beat bursts with the rare case.

[Info 4]

In the interval every burst-transfer, it is 300 ns (from CS rising to CS falling).

When using SDMA, it becomes a 700 ns interval.

*aclk_eim_slow_clk_root : 198MHz, ipg_clk_root : 66MHz, AXI_CLK(ACLK_CLK_ROOT : 396MHz, ahb_clk_root : 132MHz

[Question.1]

Can i.MX6 do the 32-beat burst write with 32bit?

[Question.2]

Is there setting information of the 32-beat burst write?

Or, is there any test result?

[Question.3]

Refer to [Info 4].

What cause is there to this delay?

And I'd like to know the detail settings.

(The big delay occurs even if it deletes access by LPDDR2 and we think that the problem is in the setting inside the CPU.)

Best Regards,

Keita

Yuri · ‎04-10-2014

torus1000 wrote :

"Finally I found the reason in 8.1.2 Supported AXI transfers as you pointed out.

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf

INCR N (N:1-16) 32-bit read transfers

INCR N (N:1-2) for 32-bit write transfers"

iMX6 EIM transfer speed with using NEON vld/vst instructions

keitanagashima · ‎04-20-2014

Dear Yuri,

Hello.

Please send me your answer as soon as possible.

Keita

keitanagashima · ‎04-11-2014

Dear Yuri,

Thank you for your reply.

>INCR N (N:1-16) 32-bit read transfers

>INCR N (N:1-2) for 32-bit write transfers"

I think above transfer are noncacheable transactions from 8.1.2 Supported AXI transfers.

But, there is below description.

----

For cacheable transactions:

• INCR4 64-bit for write transfers (evictions)

----

(I enabled L1, L2 cache)

[Q1]

When using cache about the burst write to LPDDR2 --> EIM, can "INCR4 64-bit" be realized?

(Even if I use L1, L2 cache, this burst write of LPDDR2→EIM can not be realized now.)

[Q2]

I'm looping in the processing which copies from LPDDR2 RAM area ( the data size = 0x20000 ) to EIM area (0x0C000000 - 0x0C020000 ).

(Enabled L1, L2 caches)

When the CPU writes in EIM, we think that the data stores in cache and the cache controller should access EIM in "INCR4 64-bit" with the suitable timing.

Is my understanding right?

[Q3]

It is lowering the performance of the transfer by the below factor [Info 4], too.

Why does the big delay occur?

Will it be a proper performance as i.MX6's performance?

Is there setting to improve a performance (ex, peripheral, bus, clock, etc)?

[Info 4]

In the interval every burst-transfer, it is 300 ns (from CS rising to CS falling).

When using SDMA, it becomes a 700 ns interval.

*aclk_eim_slow_clk_root : 198MHz, ipg_clk_root : 66MHz, AXI_CLK(ACLK_CLK_ROOT : 396MHz, ahb_clk_root : 132MHz

Keita

Yuri · ‎04-21-2014

Please look at my comments below.

1.
> When using cache about the burst write to LPDDR2 --> EIM, can "INCR4 64-bit" be realized?

> (Even if I use L1, L2 cache, this burst write of LPDDR2→EIM can not be realized now.)

Basically it is possible to use core cache related burst options (cache -> EIM) of the i.MX6,

if EIM address range is configured as cacheable. Please take a look at Table 22-7 (AXI Burst Cycles

Supported) of the i.MX6 DQ RM. But usually devices, connected via EIM, should be configured as

non-cacheable to avoid coherency issues.

2.
> When the CPU writes in EIM, we think that the data stores in cache and the cache controller should access
> EIM in "INCR4 64-bit" with the suitable timing.

First, the "INCR4 64-bit" is intended for 64-bit data port, the EIM supports only 32-bit port, therefore

“the controller splits the transaction when needed in some cases”. Next, EIM address range should be

configured as cacheable : this is not recommended.

3.
As for performance : delays between transactions may be caused because of arbitration issues, especially

if many other bus masters are involved.

4.
The ARM block transfer instructions are quite optimal for data transfer; although the SDMA is not so fast, but
it allows to free (ARM) core resources.

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

keitanagashima · ‎04-06-2014

Could someone follow up the question?

Keita

About NEON burst read/write by assembler instruction on i.MX6Q.

About NEON burst read/write by assembler instruction on i.MX6Q.

i.MX6_All

i.MX6Dual

i.MX6Quad