It helps if you limit your expectations of bandwidth to the actual protocol specifications.
The problem with quoting "PCIe bandwidth" and "USB bandwidth" from the marketing specifications (and those copied into the i.MX6 manual) is that they are wire speed bandwidth determinations.
PCIe 2.0 and USB are actually 8b/10b encoded (so is SATA and WiFi traffic, for that matter). That means your quoted wire speed needs to be (wire/10)*8 (yes, that is a 20% overhead) to get the actual bandwidth capable of being transmitted in bits or bytes per second (depending on your input value). PCIe 3.0 - if your cards and controller support it on your "desktop box" you used to test - reduce the overhead by using a 128b/132b encoding which obviously speeds things up a bit. Are we sure the cards you're testing are PCIe 2.0 only and can only operate at PCIe 2.0 speeds per the i.MX6 PCIe controller?
The 5 GT/s specification for PCIe is the transfer rate for 10 byte packets - so you can resolve this down to 5,000 transfers per second, or 500 Megabyte/s - this is where that value comes from. (This is also in SI units and not IEC units, so there're 1 million bytes in a Mega and not 1048576 as in a Mebibyte. Depending on what "dd" is using for the conversion, you have to factor that in..)
Add to that any protocol overhead. PCIe, USB and SATA are packet-based which means there are headers in the way, and informational traffic passing through. Depending on the overhead of the wire encoding - WiFi, PCIe, USB all have a framing/packet and handshaking overhead too, on PCIe this is somewhere between 24 and 32 bytes for every packet, with a maximum packet size of 4096 bytes - for USB the maximum packet size is dependent on the transfer type, but for bulk it's 1024 bytes, and a header. SATA transfers might be of a disk block size which is 512 bytes usually, that gives us a 6% overhead on PCIe, not to mention the overheads on SATA and USB.
That gives us potentially a much slower data transfer rate than wire speed. Obviously if drivers, controllers and all the interconnects between are doing it right, it should issue multi-block reads and writes from the disks, and pass these down the links rather than splitting them into inefficient, small packet sizes with large overheads. Are we sure that the drivers and hardware are sending multi-block reads all the way down the chain?
PCIe, USB and SATA are also serial buses, but the interfaces to the controllers are not serial. There will be a Serializer-Deserializer (SerDes) on the external PHY, which has a serialization overhead which will add to latency of the transactions, too.
In this case you have the absolute worst of all worlds - a PCIe USB controller attaching a SATA drive. We aren't even adding in average seek time for a spinning-platter drive.. that can be anywhere from tenths to multiple tens of milliseconds. Have mercy :smileygrin:
That does not explain why a so-called "150MB/s" drive (which over a so-called "500MB/s" bus, even with overheads, should be *somewhat* achievable), especially if you test it on other designs from other silicon vendors and get better performance - one would expect to see a little bit better, but you can't hope for the "maximum wire bandwidth per lane" from the bus specification, and expecting to ever see 500 SI Megabytes per second for the test is irrational.
How is the Linux driver allocating memory for these controllers? For USB, PCIe and SATA drivers it can be common that DMA allocations are either mapped as Device or Normal non-cacheable (to prevent the need for cache maintenance) - or explicit DMA cache invalidation and cleaning functionality would need to be implemented.
Interconnects are usually designed to perform best when they are doing transactions of cache-line sized and aligned data, to give the best bandwidth (at the cost of latency). If your region is not cacheable, you are putting more pressure on the interconnect. If the region is Device memory you are also adding ordering restrictions to the mix (potentially stressing the write buffers at every step).
If the region is cacheable, the Cortex-A9 coupled with a PL310 puts another roadblock in the way in some non-atomic cache operations need to be waited for at the PL310 side of things, and sometimes you have to manually flush the store buffers in the PL310 for every one of these cache operations.
If the register accesses for these controllers are using writel() in the kernel without any need for a barrier or ordering, then unfortunately writel() does come with a full system barrier after it (and potentially an outer_cache->sync() on the PL310 to get the store buffers drained). That can significantly hamper the performance of your driver.
You have to quantify all of this together to determine exactly why you don't see the speed you expect, or at least narrow it down to a very specific point where things start to fall over if you switch it out for something else (obviously you can't change the A9 or PL310 or PCIe controller or PHY, but you can use a different register write function like writel_relaxed, map memory differently or use a different cache maintenance strategy, pick a different USB controller, pick a different USB->SATA controller..)