IPU/PCIe throughput problem on i.MX6

stjepanhenc · ‎10-09-2015

Hello,

	I am currently working on a project that is trying to transfer raw video from the i.MX6 Solo ARM processor
	over a PCIe link to an FPGA. The i.MX6 in this case serves as the root complex(RC), and the FPGA is the Endpoint.
	We are using a pcie x1 gen1 link.

	The firmware running on the i.MX6 is based on this demo https://community.freescale.com/docs/DOC-95014
	Basically we want to use IPU to transfer frames to memory that is connected to the FPGA and read it back.

	I have attached screenshots from the FPGA's on-chip logic analyzer showing that writing data works fine and with
	reasonable link utilization. In the logic analyzer screenshots the x_st signals indicate the first 32-bit doubleword
	in a packet, and x_end indicates the last 32-bit doubleword in the packet.

However, we have a problem with reading data from the FPGA.

	In my opinion the problem is that the IPU doesn't read data the way it is supposed to do according to
	the PCIe specification or the way it is usually used. The PCIe specification allows the root complex
	to request a lot of data (for example 4KB), which the endpoint then completes with multiple 128B packets (TLPss).

	When using the IPU with software inspired by your PCIe validation/throughput demo, we can see that the root complex
	generates small requests - at most 64 bytes at a time. This is an issue because for some reason our current
	PCIe setup then issues at most 4 memory read TLP's at a time.

You can see from the screenshots that this leads to very poor throughput.

	I have also reproduced this problem in HDL simulation with a root complex simulation model,
	so I believe this is a limitation of PCIe.
	I am currently digging into the specification to verify this.

	My question for you is whether there is a way to get the i.MX6's PCIe and IPU modules to generate
	PCIe read requests which are at least 1KB in size.
	I believe this would greatly help with our bandwidth problem.

	Please feel free to ask me for any aditional details or clarifications about this problem,
	so this problem can be resolved.

	Kind regards,
	Stjepan Henc

eduardodelcasti · ‎09-26-2016

Hello Stjepan and Yuri,

I'm currently working in a very similar scenario than the one Stjepan explained in his first post of this thread, that is, I'm trying to transmit data from the iMX6 to a Xilinx FPGA through a pcie link x1, using the IPU DMA to have reasonable performance, given the PCIe RC in iMX6 doesn't have DMA.

The writing operations from the iMX6 to the FPGA seem to work fine but instead of reaching the 344 MB/s stated in the link of the demo, with TLP data size 64 bytes, I only get about 54MB/s due to the fact that the data size of the TLPs I receive in the FPGA side is only 16 bytes.

Stjepan, you say that in your case, the writing data operations had a reasonable link utilization. Could you give more specific numbers? Did you experience the same performance degradation as me in comparison with the demo? If not, I think it would be very useful for me to check which are the differences between your setup and mine. Stjepan and Yuri, which pcie configuration parameters should I touch in the iMX6 side in order to increase the size of the TLPs transmitted?

Thanks very much in advance for your answer.

Best Regards,

Eduardo.

Yuri · ‎09-27-2016

Hello,

Please check if Your system settings follow recommendations of

i.MX6Q PCIe EP/RC Validation System

In particular : Use mem=768M

Regards,

Yuri.

carlpii · ‎10-04-2016

Hi Yuri,

I don't see that the mem=768M boot parameter is going to have any affect on performance - it simply tells the kernel not to use all of the DDR available on the board (leaves 256M available for the EP driver test).

After experimenting with the EP/RC (non-IPU) driver code, I have been able to replicate the "cache disabled" and "cache enabled" write and read performance numbers but I am concerned that the "cache enabled" approach may not be safe in a real system. The "cache disabled" approach uses ioremap() to map the iATU region for PCIe writes and reads but the "cache enabled" approach uses ioremap_cache(). Can you confirm that marking this region as cacheable is actually safe to use? It seems like it could create both ordering and coherency problems if PCIe writes and reads are going through the caches.

Thanks,

-Carl

Yuri · ‎10-04-2016

Hello,

Strictly speaking, You are right : if PCIe writes and reads are going through the caches - this

improves performance, but coherency problems should be taken into account. It is needed to
flush cache finally.

Regards,

Yuri.

carlpii · ‎10-11-2016

Hi Yuri,

I am now attempting to replicate the "cached" performance from a non-Linux OS (RTOS) running on an i.MX6Q endpoint. The endpoint is attached to an SDB running Linux 3.14.28. My Linux driver on the SDB can access endpoint memory with expected performance but accesses initiated by the endpoint to the SDB are significantly slower. With the iATU mapped as Write-back, no Write-alloc to match the Linux ioremap_cache() case, I can only get around 32 MB/s reads and 240 MB/s writes versus the roughly 100 MB/s read and 300 MB/s write performance seen from the Linux side. With the iATU mapped as Device memory, I get around 17 MB/s reads and 42 MB/s writes versus 29 MB/s and 110 MB/s for Linux using ioremap().

The endpoint initialization code is largely derived from the iMX6 Platform SDK. I've attempted to compare settings for the L2 cache, SCU, and SCTLR register between Linux and our RTOS but can't find anything that seems to make a difference. Can you provide any suggestions on how to find our missing performance?

Thanks,

-Carl

Yuri · ‎10-11-2016

Hello,

Do You use load and store multiple register instructions to cached area

to achieve maximum throughput ?

Regards,

Yuri.

carlpii · ‎10-12-2016

Hi Yuri,

I was reviewing that as well. The memcpy() provided by our IDE is pretty well optimized but only used 4 registers for the loads and stores. I had previously tried using 8 registers and got a small performance gain but not up to Linux speeds. I just added a preload instruction to the loop and improved our PCIe read performance to 62 MB/s (was 32 MB/s). Writes remain at 240 MB/s. Much closer to Linux speeds but still room to improve. I'll look more closely at the Linux implementation as there may be more I can do.

Many of my PCIe transfers are effectively 4KB page copies - is there any way to use the IPU to transfer this data (and would it get better performance)?

I appreciate your help,

-Carl

Yuri · ‎10-14-2015

Hello,

64-byte is maximal value for single burst length under i.MX6.

Please refer to the following thread, where similar considerations

for burst length are provided.

i.MX6 maximum EIM burst length and performance

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

stjepanhenc · ‎10-15-2015

Hello,

Thank you for your answer.

So you say that this is a limitation of the i.MX6 architecture and PCIe core, so that there is no way to issue a request for more than 64 bytes of data at a time.

I believe the high overhead of PCIe read requests requires that a lot of data be requested with a single request (that's why the limit is 4KB) to achieve speeds comparable to the write channel. I see that that is not easily achievable if the burst size is so limited.

But this is just one part of the problem. We get the biggest performance hit because the i.MX6 root complex issues only 4 read requests at a time.
Can you confirm that this is not caused by your PCIe root complex?

The PCIe IP core from Lattice we are using in the FPGA is also suspicious, so I would like to eliminate one of the possibilities.

In the meantime we decided to exchange i.MX6 reading from the FPGA with FPGA writing into i.MX6 in hopes of achieving better performance.

Regards,

Stjepan

Yuri · ‎02-18-2016

Hello,

According to section 48.4.1.3 (Features List) of the i.MX6 D/Q RM :

"Programmable and extended AXI burst lengths to support up to 4K read/write burst

lengths over AXI master and slave interfaces."
But, because of "independent maximum read request and transfer sizes between AXI
and PCI Express, transfers can be split into multiple transfers.

Regards,

Yuri.

stjepanhenc · ‎05-12-2016

Well yes.

The way to get decent performance on PCI Express is to have the PCI Express endpoint/device act as a master on the bus,

and run with its own DMA, reading and writing from the processors memory. This way the endpoint has full control over packet size and number.

The processor (in this case imx6) should only read and write control information to the device, for example configuring the DMA in the PCIe device to start a transfer.

IPU/PCIe throughput problem on i.MX6

IPU/PCIe throughput problem on i.MX6

Graphics & Display

i.MX6S

Linux

Multimedia