EIM performance issue on iMX6

Rob_iMX6 · ‎08-18-2014

We have connected the EIM of an iMX6DL to a FPGA. BCLK at 99MHz, 16 bit muxed mode, synchronous.

Data transfer works, 16 bits, bursts up to 16 length too. So we can read 32 bytes in less than 240ns. And write in only 170ns. So far so good.

Clock setup is standard as done in the SDK, but CPU at 996MHz. DDR=396MHz.

However, what struggles me: After each burst, there will be silence on the bus, even though there are requests waiting:

Situation 1:

Level 1&2 cache enabled for D&I.

EIM interface NOT cached.

SDK software, only one processor enabled, all GPUs etc. clock gated OFF (also tried with on)

running NEON command vld1.64 {d0,d1,d2,d3}, [r0] in an endless loop. r0=EIM address (0x08000000, so no boundary issue here).

RESULT: bursts being generated, but approx 350ns pauses (CS high)... Performance: approx. 50MB/s or 40% of theoretical 128MB/s

QUESTION: What prevents the AXI bus or whoever from accessing quicker? there is absolutely no other traffic around!

Situation 2:

same as above, but cache enabled on EIM-address space.

for write: using cache flush on L1&L2 (only way to generate write burst in full length)

for read: using cache invalidate

RESULT: identical, performance max. 50MB/s instead of 128MB/s.

Situation 3:

all data caches DISABLED

running SDK SDMA test mem_2_mem

RESULT 1: with setup DDR_2_DDR : 40764kB/s (we only have 16bit DDR3 RAM)

RESULT 2: with setup OCRAM: 28573 kB/s (terribly bad!!)

RESULT 3: Read EIM to write OCRAM: 24116kB/s, WHICH IS TERRIBLY BAD. Remember: Our EIM is clocked at 99MHz, 16bit. I would expect something close to 128MB/s, but we're at 19% of this speed...

Result 4: read OCRAM and write to EIM: slightly better due to shorter burst access to our FPGA, but still poorest performance.

On the scope I see idle phases (CS high) of 1050ns.

I guess that the NIC-301 unit delays the accesses. Unfortunately, there is not enough information, on how to change the settings. Does somebody know more?

Regards,

-Urs

nishad_kamdar · ‎03-30-2015

Hi Rob,

I too have interfaced an iMX6Q to an FPGA over the EIM bus with 32 bit data bus. I am operating in synchronous mode.

I have set the burst length to 32bit.

asm("pld [%[address]]\n\t"

"vldm %[address],{d0-d15}\n\t"

"vstm %[address1]!,{d0-d15}\n\t"

://outputs

:[address]"r"(addr),[address1]"r"(addr1)//inputs

:"d0","d1","d2","d3","memory"//clobbers

);

Above is the code.

After I run the code,

The problem I am facing is, when I execute above instruction I get TWO chip select pulses.

I get 32 burst cycles on each chip select, however, I get data in my buffer on 16th clock cycle of the 32 clock cycles(on each chip select) and thus i miss starting 16 words of data.

What am i doing wrong here?

My EIM setting:

CS1GCR1:0x31130335 //cs1 en , DSZ = 32 bit , PSZ = 64words, BL = 32 words

CS1GCR2:0x00000000

CS1RCR1:0x10000000

CS1RCR2:0x00000000

CS1WCR1:0x01630c00

CS1WCR2:0x00000000

nishad_kamdar · ‎03-31-2015

Got it!!

The problem was with RWSC setting it was set to 16 so,it was missing starting 16 values.

Now i have set RWSC to 1, now i miss the first value,in fixed latency mode.

Is there any way to set RWSC to zero (in datasheet 0 is reserved).

Or do I have to use WAIT signal?

baldevpatel · ‎07-08-2015

Hi Nishad

I am using following EIM bus configuration

Mode :Synchronous

BL:32

Data width:16 width

i am able to generate burst signal with 4 clock cycle(1 cycle =2 byte and total bytes =8) with BL 011 setting.

As per my understanding of Burst length when BL is set to 011 (32 words Memory wrap burst length) in EIM_CSnGCR1 configuration register EIM controller should generate 32 clock cycles for 64 bytes transfer.

i think you are able to generate 32 clock cycle(64 bytes transfer).could you please suggest me what setting needs to be done for 32 clock generation in one burst cycle?

Yuri · ‎08-18-2014

First, note, the i.MX6 is not intended for real-time applications, say as controllers

(Cortex-M). For EIM accesses, due to internal bus limitation, bus turn around time
may be about 150ns, because of latency to go through the couple of PL301 cross bars
and the AIPS peripheral bridge. Also, to use D-cache it is needed to configure the MMU.

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

TomE · ‎08-19-2014

Even the old controllers don't run as fast as an initial read of the manual would suggest. There's always some unexpected hardware in the way.

Even a dedicated controller like the Coldfire series chips have slow SPI interfaces. The MCF5329 QSPI (Queued SPI) should be fast, but it has lots of "internal overhead" between bytes meaning at 20MHz it can only run at a maximum of 62% of "theoretical throughput". In addition to that, because it takes 15 CPU clocks to read or write a peripheral register it takes 750 CPU clocks of delays just to read and write the QSPI registers to launch a 16-byte transfer:

https://community.freescale.com/message/332628#332628

The memory interface on these parts is meant to be 128 MB/s, but in reality you can only get a fraction of that as you can't keep the memory controller "on page".

There are always unexpected delays between the CPU and the I/O blocks as they're on the other side of bridges and running on very slow clocks. I've had an ARM core taking 400 CPU clocks to read or write a GPIO pin. The manufacturer's recommendation was to use the DMA module to perform all I/O, including serial and even getting to the GPIO pins!

I was going to suggest using SDMA, but it seems you've already tried that. DMA might be a good idea (if it can burst), but you'll also get all the internal SDMA delays while it works out what to do next. At least you should have the CPU able to do something else while the SDMA is running, like processing the data it is getting from the FPGA.

> Performance: approx. 50MB/s or 40% of theoretical 128MB/s

What performance do you NEED for the hardware to do the job required? It may not be able to do the job at all as designed.

If you need high data throughput, use PCIe. It looks horribly complicated though.

> for write: using cache flush

Are you flushing individual lines? I've heard that a full cache flush takes hundreds of microseconds on these CPUs and locks out all cores.

Tom

TomE · ‎08-25-2014

> If you need high data throughput, use PCIe.

Or maybe SATA, which is meant to be 3Gb/s.

Except people aren't getting that sort of speed out of them either:

https://community.freescale.com/thread/310016

Tom

EIM performance issue on iMX6

EIM performance issue on iMX6

i.MX6_All