i.MX6 PCIE to XCHI USB3.0 Performance Issues

ottawa-imx6 · ‎07-23-2013

Hello everyone,
I have an issue that I am trying to resolve which involves PCIE, XHCI, and USB 3.0 on the i.MX6Q systems.
The issue I am having is that my data transfer rates are too slow and as such this leads me to believe that PCIE2.0 and USB3.0 XHCI might not be configured optimally using the freescale kernel.

Attempted hardware:
SABRELITE
SABRESDB
Renesas USB3.0 Controller upd720201
TI TUSB7320

Kernels
up to and including 3.0.35-4.0.0 with PCIE, XHCI_HCD, and MSI enabled.

Examples

/dev/sda = external sata drive connected through USB3.0 and the PCIE controller.
/dev/sdb = sata drive connected directly to the i.MX6
/dev/sdc = second external sata drive

Through seperate tests on intel machines, I have verified that:
All drives are capable of (and proven on the same renesas and ti controllers) of well over 150MB/s write speed.
Both the TI and Rensesas controllers both are also capable of the bandwidth required.

My results on the i.MX6

CASE 1 : dd if=/dev/zero of=/dev/sda bs=1M write-speed ~60-70MB/s

CASE 2 : dd if=dev/sda of=/dev/sdb bs=1M write-speed ~35-40MB/s

CASE 3 : dd if=/dev/sda of=/dev/sdc bs=1M write-speed ~30-35MB/s

As I can tell there is a significant drop in the already limited maximum bandwidth (70MB/s while writing zeros from ram) down to 35MB/s when I try to transfer from one disk to another through the PCIE and usb3.0 interface.
Granted that I am using the interface for two drives simultaneously, it makes sense that it would be slower. I would imagine that if CASE 1 was performing at a higher rate then CASE 3 would be less relatively so.

The main issue seems to be with the baseline slow speed in CASE 1. 70MB/s is nowhere near the functional rates for PCIE2.0 + USB3.0 (I understand that there is overhead, but 70MB/s is still multiple times slower than it should be).

My questions to anyone who has experience with this is:
1. Has anyone seen any better performance than this?
2. Can anyone at freescale verify that this is the maximum performance possible through these interfaces? Are there any plans to update the kernel to improve this?

Thanks!
Brent

tomgao · ‎04-10-2017

I am us the Renesas USB3.0 Controller upd720201 but it always note me "

xhci_hcd 0000:01:00.0: xHCI Host Controller
xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
xhci_hcd 0000:01:00.0: xHCI capability registers at c09a4000:
xhci_hcd 0000:01:00.0: CAPLENGTH AND HCIVERSION 0x0:
xhci_hcd 0000:01:00.0: CAPLENGTH: 0x0
xhci_hcd 0000:01:00.0: HCIVERSION: 0x0
xhci_hcd 0000:01:00.0: HCSPARAMS 1: 0x0
xhci_hcd 0000:01:00.0:   Max device slots: 0
xhci_hcd 0000:01:00.0:   Max interrupters: 0
xhci_hcd 0000:01:00.0:   Max ports: 0
xhci_hcd 0000:01:00.0: HCSPARAMS 2: 0x0
xhci_hcd 0000:01:00.0:   Isoc scheduling threshold: 0
xhci_hcd 0000:01:00.0:   Maximum allowed segments in event ring: 0
xhci_hcd 0000:01:00.0: HCSPARAMS 3 0x0:
xhci_hcd 0000:01:00.0:   Worst case U1 device exit latency: 0
xhci_hcd 0000:01:00.0:   Worst case U2 device exit latency: 0
xhci_hcd 0000:01:00.0: HCC PARAMS 0x0:
xhci_hcd 0000:01:00.0:   HC generates 32 bit addresses
xhci_hcd 0000:01:00.0:   FIXME: more HCCPARAMS debugging
xhci_hcd 0000:01:00.0: RTSOFF 0x0:
xhci_hcd 0000:01:00.0: xHCI operational registers at c09a4000:
xhci_hcd 0000:01:00.0: USBCMD 0x0:
xhci_hcd 0000:01:00.0:   HC is being stopped
xhci_hcd 0000:01:00.0:   HC has finished hard reset
xhci_hcd 0000:01:00.0:   Event Interrupts disabled
xhci_hcd 0000:01:00.0:   Host System Error Interrupts disabled
xhci_hcd 0000:01:00.0:   HC has finished light reset
xhci_hcd 0000:01:00.0: USBSTS 0x0:
xhci_hcd 0000:01:00.0:   Event ring is empty
xhci_hcd 0000:01:00.0:   No Host System Error
xhci_hcd 0000:01:00.0:   HC is running
xhci_hcd 0000:01:00.0: // Halt the HC
xhci_hcd 0000:01:00.0: Host not halted after 16000 microseconds.（看上去好像是芯片没有响应linux给的命令）
xhci_hcd 0000:01:00.0: can't test  setup: -110
xhci_hcd 0000:01:00.0: USB bus 1 deregistered
xhci_hcd 0000:01:00.0: init 0000:01:00.0 fail, -110
xhci_hcd: probe of 0000:01:00.0 failed with error -110
usbcore: registered new interface driver usb-storage
2184800.usbmisc supply vbus-wakeup not found, using dummy regulator
ci_hdrc ci_hdrc.1: doesn't support gadget
ci_hdrc ci_hdrc.1: EHCI Host Controller

could you explain the steps of how to porting ?

TomE · ‎08-25-2014

> dd if=/dev/zero of=/dev/sda bs=1M write-speed ~60-70MB/s

Do those again, but with "time" in front of them, like:

root@triton1:/tmp# time dd if=/dev/zero of=/dev/mtd7 bs=1M

dd: writing `/dev/mtd7': No space left on device

19+0 records in

18+0 records out

18874368 bytes (19 MB) copied, 2.06188 s, 9.2 MB/s

real 0m2.068s

user 0m0.000s

sys 0m1.400s

That should show most of the time in "sys" rather than in user space.

To try and find out where it is spending most of its time, build a kernel with FTRACE on. Read this post for some more details:

https://community.freescale.com/message/427540#427540

> I suspect that as more users look to rely on pcie, xhci and usb3.0 in the future, the drivers will improve.

Only if you're the user that improved the drivers. Nobody else will.

Then there's the effort and time of pushing the improvements back into the mainstream tree - which won't necessarily get them into the Freescale branch. Or you try to get your changes back into the Freescale tree.

If that's too much effort for you to do, it is probably too much effort for everyone else that is likely to do this work.

Tom

ottawa-imx6 · ‎08-28-2014

Thanks for the insight Tom. Ill post back with some results!

sanchayanmaity · ‎02-23-2015

Hello,

By any chance did you figure out the issue?

I have a iMX6 and xhci_hcd driver enabled. For a USB 3.0 pendrive I have the below results. Write is roughly around 8 MBps and reads are as below. I have tested a Anker PCIe USB 3.0 card with VIA technologies USB controller and a Transcend PCIe USB3.0 card with Renesas USB controller. Kernel is 3.10.17 and PCIe is 2.0. I seem to get similar write speeds on 3.0 USB port of my laptop while reads are way high on laptop.

Using hdparm for reads

Timing cached reads: 650 MB in 2.00 seconds = 324.62 MB/sec

Timing buffered disk reads: 184 MB in 3.01 seconds = 61.05 MB/sec

Timing O_DIRECT disk reads: 196 MB in 3.01 seconds = 65.09 MB/sec

time dd if=/dev/zero bs=128k count=4096 of=/media/sdb1/tmpfile

4096+0 records in

4096+0 records out

real 0m56.566s

user 0m0.000s

sys 0m3.690s

Also for mSATA

Timing cached reads: 684 MB in 2.00 seconds = 341.73 MB/sec

Timing buffered disk reads: 318 MB in 3.01 seconds = 105.66 MB/sec

Timing O_DIRECT disk reads: 292 MB in 3.00 seconds = 97.25 MB/sec

time dd if=/dev/zero bs=128k count=4096 of=/media/sda1/tmpfile

4096+0 records in

4096+0 records out

real 0m4.304s

user 0m0.000s

sys 0m4.300s

Regards,

Sanchayan.

乐乐季 · ‎01-16-2015

Hi ottawa:

I'm use the uPD720201 for pcie to usb3.0,but my pcie no clock.

it is my configure:

System Type --->Freescale MXC Implementations --->[*] PCI Express support (no EP and RC)

Bus support ---> PCI Express support

my pcie clock signal is CLK1_P and CLK1_N.

what dou you slove?

sinanakman · ‎09-29-2013

I think it should be up to ottawa-imx6 to mark if his question was answered correctly. Otherwise the "correct answer" mark can be misleading. In this particular case Yixing marks Zhi's answer correct soon after he asks the original author ottawa-imx6 to mark if he feels he received a correct answer. Perhaps you take his no reply equal to "answer is correct". I'd appreciate if you could review your correct answer marking policy. The marking procedure doesn't sound right.

Regards

Sinan Akman

YixingKong · ‎08-22-2014

Sinan

If no reply from original issue creator, The right choice is "Assumed Answered" on the top of this page, instead of "Correct Answer". We will leave this discussion open if anyone requires. There is no policy to force an issue close without real Correct Answer.

Thanks,

Yixing

sinanakman · ‎08-22-2014

Hi Yixing, sounds good. I was just pointing out the original poster's question was not answered. You'll need to follow up with Brent (ottawa-imx6) if you have any other technical info to provide.

Regards

Sinan Akman

ottawa-imx6 · ‎10-06-2013

I agree with you 100% Sinan,

This 'correct answer' is certainly not the correct answer as it has not addressed any of the issues I am having.... it is in fact quite misleading.

My original post clearly outlines the speeds at which my hardware is operating.

I would appreciate if freescale could mark this question as unanswered. It seems to be a theme on these support boards and forums to mark questions as answered before they are answered.

Thanks

Brent

TomE · ‎08-22-2014

> it seems to be a theme on these support boards and forums to mark questions as answered before they are answered.

That usually means that Management or the HR Division feels the need to measure SOMETHING (anything) and so makes a list of "Key Performance Indicators". If one of them is simplistically "number of issues MARKED CLOSED per month" then they'll get MARKED closed irrespective of what actually happened.

I had exactly the same thing happen when Freescale had a "Ticket System". They'd be forced closed with "answers" that didn't help.

I worked at a place where a major KPI was "Process Improvement Suggestions per Month". Just a count. No consideration for "quality". The database accumulated thousands of trivial, useless and duplicated "suggestions" made in the last 15 minutes of ever month. There were no KPIs for "implementing any of the suggestions" so you can guess what happened to them.

Tom

YixingKong · ‎09-24-2013

If your question got answered, please click Correct Answer/Helpful Answer.

Thanks,

Yixing

FrankLi1z · ‎09-24-2013

Richard used test usb3.0 u-thumb disk on PCIetoUSB3.0 PCIe card.

The through-put is mostly limited by the u-thumb disk.

As I remember that, Read is about 70MB/s, and write is about 20MB/s.

imax6

sinanakman · ‎07-30-2013

Brent, what do you get when you perform /dev/zero to sdb and /dev/zero to sdc, respectively ?

Thanks

Sinan Akman

ottawa-imx6 · ‎07-31-2013

Hello Sinan, thanks for your response.

sdc is a second sata drive through a second USB3.0 (so sata->usb3.0->pcie->i.mx6->pcie->usb3.0->sata), so its results were the same speed as CASE 1.

dd if=/dev/zero of=/dev/sdb was writing zero directly over the native sata port and its speeds were comparable to case 1 although slightly faster,

Here is the summary again since I realize I was not as clear as I could have been..

/dev/sda = external sata drive connected through USB3.0 host port and the PCIE controller.
/dev/sdb = sata drive connected directly to the i.MX6
/dev/sdc = second external sata drive connected through second USB3.0 host port on the same PCIE controller.

My results on the i.MX6

CASE 1 : dd if=/dev/zero of=/dev/sda bs=1M write-speed ~60-70MB/s

CASE 2 : dd if=dev/sda of=/dev/sdb bs=1M write-speed ~35-40MB/s

CASE 3 : dd if=/dev/sda of=/dev/sdc bs=1M write-speed ~30-35MB/s

CASE 4: dd if=/dev/zero of=/dev/sdc bs=1M write-speed ~80-90MB/s

I was only able to verify CASE 4 on the sabrelite and not the sabresd since the sata port placement on that board is not useable with normal drives. I will provide a result for the sabre sd shortly.

Thanks!

Brent

gauthier · ‎08-19-2014

> /dev/sdb = sata drive connected directly to the i.MX6

--- 8< ---

> CASE 1 : dd if=/dev/zero of=/dev/sda bs=1M write-speed ~60-70MB/s

--- 8< ---

> dd if=/dev/zero of=/dev/sdb was writing zero directly over the native sata port and its speeds were comparable to case 1 although slightly faster,

--- 8< ---

So even over the sata port, with no USB or PCIe involved, you get only 70 MB/s? Isn't that already too little?

ottawa-imx6 · ‎08-20-2014

Yes, it is very poor performance even taking pcie/usb3.0 out of the picture completely, the SATA performance is nowhere near where it should be.

Are you facing similar issues?

Thanks
Brent

gauthier · ‎08-21-2014

Not yet, but I was considering putting a mPCIe controller board on my SabreSD to get USB3.0 for a video camera. Renesas PD72020 as well.

I am a bit concerned that you don't get the speed for the disks on USB3.0. On the other hand, since you don't get it on SATA either, the issue might be somewhere else.

It sounds like you still don't know what your speed issue is, have you heard anyone getting full speed on their SATA? It feels like it's the first issue you should concentrate on.

Have you tested the mPCIe controller on a desktop computer?

ottawa-imx6 · ‎08-21-2014

The Renesas (And TI) USB3.0 controllers were tested on desktop machines and were able to sustain close to 150MB/s.

As well, the Sata drive that I used for the sata test was capable of well over that.

The issue is with with either the kernel drivers, or the MX6 itself. I am running mainline 3.16 and saw minor improvement (2-3MB/s) from my original tests, but not much improved.

In any case, we went ahead with the custom board and our only option now is to try to optimize the firmware for future performance gains.

I requested help from Freescale, but a minimum engagement ($$$$) was required for them to even look at it....so I am looking at third party consulting now to improve performance to where it is within at least 25% of the rated specs.

I wish you luck!

Cheers

Brent

gauthier · ‎08-22-2014

Interesting. If your drive was capable of well over that, and USB3.0 max is 5 Gbit/s, why didn't it max out the drive? Is the mPCIe x1 the limiting factor? Do you know which version of PCIe the desktop was using?

PCI Express - Wikipedia, the free encyclopedia

The i.MX6 doc says it has mPCIe 2.0, that is 500 MB/s.

- i.MX6 mPCIe 2.0: 500 MB/s

- USB3.0: 600 MB/s

- your drive: capable of well over 150 MB/s

So why didn't you get more than 150 MB/s? It seems also too little.

matthewsealey · ‎08-28-2014

It helps if you limit your expectations of bandwidth to the actual protocol specifications.

The problem with quoting "PCIe bandwidth" and "USB bandwidth" from the marketing specifications (and those copied into the i.MX6 manual) is that they are wire speed bandwidth determinations.

PCIe 2.0 and USB are actually 8b/10b encoded (so is SATA and WiFi traffic, for that matter). That means your quoted wire speed needs to be (wire/10)*8 (yes, that is a 20% overhead) to get the actual bandwidth capable of being transmitted in bits or bytes per second (depending on your input value). PCIe 3.0 - if your cards and controller support it on your "desktop box" you used to test - reduce the overhead by using a 128b/132b encoding which obviously speeds things up a bit. Are we sure the cards you're testing are PCIe 2.0 only and can only operate at PCIe 2.0 speeds per the i.MX6 PCIe controller?

The 5 GT/s specification for PCIe is the transfer rate for 10 byte packets - so you can resolve this down to 5,000 transfers per second, or 500 Megabyte/s - this is where that value comes from. (This is also in SI units and not IEC units, so there're 1 million bytes in a Mega and not 1048576 as in a Mebibyte. Depending on what "dd" is using for the conversion, you have to factor that in..)

Add to that any protocol overhead. PCIe, USB and SATA are packet-based which means there are headers in the way, and informational traffic passing through. Depending on the overhead of the wire encoding - WiFi, PCIe, USB all have a framing/packet and handshaking overhead too, on PCIe this is somewhere between 24 and 32 bytes for every packet, with a maximum packet size of 4096 bytes - for USB the maximum packet size is dependent on the transfer type, but for bulk it's 1024 bytes, and a header. SATA transfers might be of a disk block size which is 512 bytes usually, that gives us a 6% overhead on PCIe, not to mention the overheads on SATA and USB.

That gives us potentially a much slower data transfer rate than wire speed. Obviously if drivers, controllers and all the interconnects between are doing it right, it should issue multi-block reads and writes from the disks, and pass these down the links rather than splitting them into inefficient, small packet sizes with large overheads. Are we sure that the drivers and hardware are sending multi-block reads all the way down the chain?

PCIe, USB and SATA are also serial buses, but the interfaces to the controllers are not serial. There will be a Serializer-Deserializer (SerDes) on the external PHY, which has a serialization overhead which will add to latency of the transactions, too.

In this case you have the absolute worst of all worlds - a PCIe USB controller attaching a SATA drive. We aren't even adding in average seek time for a spinning-platter drive.. that can be anywhere from tenths to multiple tens of milliseconds. Have mercy :smileygrin:

That does not explain why a so-called "150MB/s" drive (which over a so-called "500MB/s" bus, even with overheads, should be *somewhat* achievable), especially if you test it on other designs from other silicon vendors and get better performance - one would expect to see a little bit better, but you can't hope for the "maximum wire bandwidth per lane" from the bus specification, and expecting to ever see 500 SI Megabytes per second for the test is irrational.

How is the Linux driver allocating memory for these controllers? For USB, PCIe and SATA drivers it can be common that DMA allocations are either mapped as Device or Normal non-cacheable (to prevent the need for cache maintenance) - or explicit DMA cache invalidation and cleaning functionality would need to be implemented.

Interconnects are usually designed to perform best when they are doing transactions of cache-line sized and aligned data, to give the best bandwidth (at the cost of latency). If your region is not cacheable, you are putting more pressure on the interconnect. If the region is Device memory you are also adding ordering restrictions to the mix (potentially stressing the write buffers at every step).

If the region is cacheable, the Cortex-A9 coupled with a PL310 puts another roadblock in the way in some non-atomic cache operations need to be waited for at the PL310 side of things, and sometimes you have to manually flush the store buffers in the PL310 for every one of these cache operations.

If the register accesses for these controllers are using writel() in the kernel without any need for a barrier or ordering, then unfortunately writel() does come with a full system barrier after it (and potentially an outer_cache->sync() on the PL310 to get the store buffers drained). That can significantly hamper the performance of your driver.

You have to quantify all of this together to determine exactly why you don't see the speed you expect, or at least narrow it down to a very specific point where things start to fall over if you switch it out for something else (obviously you can't change the A9 or PL310 or PCIe controller or PHY, but you can use a different register write function like writel_relaxed, map memory differently or use a different cache maintenance strategy, pick a different USB controller, pick a different USB->SATA controller..)

i.MX6 PCIE to XCHI USB3.0 Performance Issues

i.MX6 PCIE to XCHI USB3.0 Performance Issues

i.MX6_All

i.MX6Quad

Linux