K60 and K70 USB Question: True high-speed operation (480 MBps)

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

K60 and K70 USB Question: True high-speed operation (480 MBps)

4,171 Views
CzechThunder
Contributor III

Hello Freescale Forum,

I have a TWR-K60 application that needs to transfer data at a high, fixed rate over USB.  I seem to be hitting a wall at around 2 MBps (megabits per second) and would ideally like to see 5 MBps or more.

I am using the CDC example project "USB_device" with the "Freescale_CDC_Driver_Kinetis.inf"  file configuring up a virtual COM port on the PC side.  I very much want to stay with this device class as it is straight forward and works with virtually any Widonws XP/ Win 7 system.  I know an audio type device with isochronous exchange can provide more throughput but I was hoping to avoid that complexity and don't want to risk dropping frames.

I know that on the hardware side a USB PHY chip is required to realize tru high-speed (480 MBps) operation.  The TWR-SER2 ehhanced serial module provides this.  We have a very helpful Freescale rep. who has a TWR-SER2 board on order for us.

On the software side, the sample CDC driver is working very well and I was even able to modify the BULK endpoint to use 64 byte packets versus the 32 byte packets it was set up for.  My question is, can I use a modified version of this driver to make high-speed BULK USB function properly?  I would have to change the packet size to 512 bytes at a minimum.   Would there be changes required on the Windows side as well?

Any advice appreciated.

15 Replies

1,391 Views
vladimirkhusain
Contributor III

For whatever it is worth, we (Emcraft Systems) are getting a throughput of ~1MBytes/sec for an UDP test over a USB-based WiFi dongle. This is for a 150MHz Kinetis K70 (or K61, doesn't matter) running Linux, using the USB HS interface.

Using USB WiFi with K70 under uCLinux

I am unable to quantify the overhead of the Linux TCP/IP stack, as related to the raw USB throughput, but it is probably quite substantial.

We also have a sample project that mounts a USB Flash as a disk but I don't have performance figures for that handy.

For anyone willing to take a look, the Emcraft Linux kernel, inclduing the full USB stack,  is available from our repository at github.com:

EmcraftSystems/linux-emcraft · GitHub

0 Kudos

1,391 Views
mjbcswitzerland
Specialist V

Hi

 

WIth Full-Speed bulk I would expect that you will be able to achieve close to 8Mb/s throughput when not passing through a USB hub sharing the connection with other devices.

 

If you have a USB analyser you can monitor the traffic and see the data throughput (frames with data rather than NAKs) to see whether there are delays between data blocks or periods where the data stalls for a short time, thus decreasing the overall throughput.

 

If you are not achieving more than 2 Mb/s with full speed it is almost certainly due to SW and not the USB speed/PC host. Therefore I would recommend first solving this rather than moving to 480Mb/s bit rate with the added complexity of the external HS transceiver. The reason being that any SW limits will still be present no matter how fast the bit rate is - there is no point in having HS if the SW can't acheive more that FS is capable of.

 

Generally one has to question the real benefits of HS USB for general purpose processors because it is a heavy load on the CPU to try to keep up with the possible data rate of 60MByte/s thoughput, whereby USB tends to be very interrupt intensive. Even the Kinetis will probably be hard pushed to get anywhere near this rate in real applications.

 

Regards

 

Mark

 

1,390 Views
CzechThunder
Contributor III

Thanks Mark.  Very good advice.   I will dig into this a little deeper and see what I can divine.  I know the topic of USB throughput comes up a lot so I'll try to share any useful revelations I have.  I know that on a slighly older (3 or 4 years old) Windows XP machine, CPU useage spikes up into the 70 or 80% area when this data is being sent & captured with a basic terminal program.  So I haven't ruled out the PC side or Windows driver as the bottleneck.

 

rgds,

 

Gary

0 Kudos

1,390 Views
mjbcswitzerland
Specialist V

Hi Gary

 

I just checked reading large files from an SD card to USB (USB-MSD) and saw about 6M bit/s speed (eg. a 20MByte file was being copied in about 27s). This includes delays in reading from the SD card via SDHC on a K60 at 100MHz (SD card at 25Mb/s) plus the USB-MSD overhead.

 

CDC will generally have rather less overhead than this (no SD card accesses and less USB overhead during the bulk-endpoint transer) so 5Mb/s should not be any problem.

 

Regards

 

Mark

 

0 Kudos

1,391 Views
danielchai
Senior Contributor I

Hi Mark,

How did you test this? Could you share the test project  of USB-SD MSD with me?

Thank you.

-Daniel

0 Kudos

1,390 Views
CzechThunder
Contributor III

Hi again Mark.

 

Thanks for running this test and sharing the updated timing information.

 

Here's what I found out after digging into this a little further.  The non-MQX CDC project provided by Freescale is a very good starting point, but needs a few tweaks.  For starters, the routine that transmits from the device to the host (as a reponse to an IN request) doesn't packetize long transfers and doesn't check the OWN bit when writing to the BDT buffer.  In other words, it's really only set up for 64 byte maximum transfers. 

 

After enhancing the transmit routine slightly, I was able to get 192 byte (3 packet) transers being sent from the device to the host in every 1ms frame.  And then I hit a wall.  I'm assuming the host device driver (usbser.sys) was somehow throttling the transfers even though the IN request it was sending were for 4096 byte packets.  I watched this on USBlyzer.com. I know that writing a custom device driver on the host side is an option and a lot of companies do that but I don't have the bandwidth to take that on.  I need something fairly "off the shelf".  

 

But the big show stopper with that approach is that the device has to babysit the packetized transfer not too unlike a UART.  I was seeing a 70 uSec delay between each of the 64 byte outgoing packets.  And I would assume that the USB-MSD example you referenced, the K60 would be very highly taxed to acheive the transfer rates you were seeing.  This is simply too much CPU overhead for my application. 

 

So my current approach is to use a K70 processor with a true high speed USB controller.  I need to fire off a single, 512 byte packet either every 4 or 8 microframes, where a microframe is a 125 uSec interval.  This will satisfy my transfer speed without burdening my CPU.  The problem is that the high speed module in the K70 is essentially brand new, is not fully documented (the bitfields in the USBHS_EPSETUPSR are not spelled out, for example) and there are very few sample projects for it.  I have a single HID example running at true high speed but it's a fairly crude exampe.  So my challenge is to port the existing full speed CDC project over to the K70's high speed controller or wait for Freescale to release more example classes.  Supposedly there are more examples being released on March 8th.

 

A little long winded but that's where I'm at now.

 

rgds,

 

Gary

 

 

 

0 Kudos

1,391 Views
BlackNight
NXP Employee
NXP Employee

Hi Gary,

could you share what you have changed in the transmit routine?

I have ported/transformed the Freescale USB Stack 3.1.1 into a Processor Expert component, and this works pretty well for CDC and ColdFire V2. I'm using two ring buffers for send and receive, but still I'm seeing some glitches in the packets. So this indeed could be because of the OWN bit.

 

Thanks,

BK

0 Kudos

1,391 Views
mjbcswitzerland
Specialist V

Hi Gary

 

Fact is that USB is quite interrupt intensive.

To achive about maximum full speed thoughput the driver will be handling an interrupt about once every 52us. This interrupt must copy data (64 bytes each time to fill an endpoint buffer descriptor). Since the endpoints are double buffered it is possible to do this fairly easily with V2 Coldfires (which have the same USB controller) running at 40MHz or so. The Kinetis run faster and so can do this even easier.

 

To obtain maximum HS throughput you would be transfering larger amounts of data in each interrupt. Maximum HS would be interrupted about every 10us which will need to copy 512 bytes to the next buffer descriptor. I don't know how much power is required to actually keep up with this rate but I doubt that a Kinetis will get very near to the best case.

 

I think that you are banking on missing a lot of the IN tokens and just preparing 512 byte frames once every 500us or so. This would achieve 8Mb/s with a 512 byte copy between each interrupt. As long as the application can prepare the data directly in the next buffer descriptor the copies (these are often the highest overhead) wouldn't be needed in the interrupt itself (same would be true in HS mode but the application would have to to make more effort to pack in the shorter frames at a higher rate).

 

The fact that HS supports larger frame size does mean that there is less interrupt overhead (factor of about 8) but the amount of data shovelling to keep the next buffer descriptor filled to achieve a certain throughput will be approximately the same. I don't know whether the impact of the interrupt is that great in comparison to the data copy part though; I suppose your results will show this best.

 

Regards

 

Mark

 

0 Kudos

1,391 Views
CzechThunder
Contributor III

"I think that you are banking on missing a lot of the IN tokens and just preparing 512 byte frames once every 500us or so. This would achieve 8Mb/s with a 512 byte copy between each interrupt. As long as the application can prepare the data directly in the next buffer descriptor the copies (these are often the highest overhead) wouldn't be needed in the interrupt itself (same would be true in HS mode but the application would have to to make more effort to pack in the shorter frames at a higher rate)."

Bingo.  This is exactly my strategy.  I will start out with a 500 uSec interrupt and may back off to a 1ms or so.  My understanding is that the USB descriptor lets you tell the host the desired polling interval via the 'bInterval' setting. But the basic idea is to offload most of the heavy lifting to the the SIE in the K70's USB controller.  The faster bus clock means it can manage larger packets more efficiently.  

And you are also correct about the data shoveling.  That part of the CPU burden stays the same with either a full speed or high speed USB port.  For reference, I timed a 512 byte memory copy operation at 38.4 uSec with a K70 tower development board clocked at 120MHz.

Again, I really appreciate your thoughtful replies.

rgds,

Gary

0 Kudos

1,391 Views
mjbcswitzerland
Specialist V

Gary

 

bInterval is only used by interrupt and isochronouse endpoints and should be set to 0 for bulk (as used by CDC). It is presumably also not looked at by the host when bulk is specified but may affect a hub (or a badly behaved hub ?).

 

When you use bulk the host will be tyring to read as many IN frames as possible (limited by available bandwidth, which may be shared with other devices connected to the same host). If the USB controller has no data to send it will simply respond with  a NAK so the fact is that you can simply send your data at any rate you like since it will be transferred when the next IN token happens to be received (which is generally almost immediately). If you only send one frame of data every 10 minutes it makes no difference and so there is no need to configure the host to known at what rate to request INs. IN NAKs don't cause the SW to be interrupted in any way since they are handled by the USB controller hardware - it simply sends the NAK immediately when there is no buffer descriptor waiting to be serviced.

 

Regards

 

Mark

 

 

 

0 Kudos

1,391 Views
CzechThunder
Contributor III

Very good know that bInterval setting doesn't apply for a high speed bulk interrupts.  And I get what you're saying about the NAK and the buffer descriptor being used as a mechnism to control output respose rate.  I'm thinking the IN requests must be timed on microframe intervals, i.e.,  some multiple of 125 uSec?

 

I'm very anxious to get this all going and report my timing.  Going to take some work to port this over to the K70.

 

Gary

0 Kudos

1,391 Views
CzechThunder
Contributor III

meant to write "high speed bulk endpoints". 

 

0 Kudos

1,394 Views
mjbcswitzerland
Specialist V

Gary

 

When there is a single IN endpoint on a FS USB connection and it is not shared by other devices (via hub) and there is no other traffic (such as OUT frames) the host sends an IN about once every 12us. When there is an immedite NAK because there is no IN data waiting, the host will immediately try again with the next In token (this gives the approx 12us IN attempt period).

 

When there is IN data returned, the next IN token is sent as soon as the data has been transferred. When there are continuousmaximum length responses to each IN it means that there are 19 IN requests with 64 byte data responses in a 1ms framework interval, giving the maximum bulk data throughput of 9.7Mb/s (the overhead is 23.7% giving the constant bit rate of 12Mb/s).

 

HS will be faster and I expect (although I don't have a HS bulk device at hand to measure) that the host will be sending IN tokens at a rate of about 300ns or so - therefore there will be about 400 IN tokens in each 125us microframe assuming each is NAKed. As noted before, the USB controller handles this and the processor doesn't need to do anything (thankfully...;-). Note that the overhead for HS bulk data is 25% so, if the processor could prepare 512bytes of data for each In token (at a rate of about 10.4us), the maximum data throughput of 384Mb/s would be achieved.

 

Regards

 

Mark

 

0 Kudos

1,394 Views
CzechThunder
Contributor III

Long overdue reply to this thread.  Was finally able to get the results I was looking for by using the TWR2-SER board combined with the TWR-K70 main board.

 

This took a bit of work but I was able to port the USB CDC demo project from Freescales MQX early access release.  I'm not sure if this is publicly available yet but I got it through my Freescale rep.

 

Regardless, this is a very fully featured USB stack designed to be dynamically loaded and designed to support multiple USB devices simultaneously.  I had to remove the MQX hooks and changed the dynamic memory allocation to static.  I also simplified some of the handle passing and  typedef overloading to make it a little more debuggable for my needs.

 

Bottom line is that I'm seeing at least 12 Megabit per second throughput with usbser.sys (using dual-core i5 Win 7 laptop) and running up against the limit of the K70 processor itself.  The stack is also very efficient.   Writing a single BULK INPUT packet (up to 512 bytes for USB 2.0 High Speed) only takes 17uSec.   The stack itself does not appear to do a buffer copy operation but rather passes a user supplied buffer address to the EHCI.

 

I'm very early into my benchmarking and don't have a feel for long term stability yet but early indiciations are that this will be a workable solution.

 

 

0 Kudos

1,394 Views
Spikey_Mikey
Contributor II

Would you be willing to share your code as an example?  I have been struggling with the same USB throughput issues and would love to have some code that works.  At the end of the day I am trying to design a bare metal project that can push at least 8Mb/s, ideally using CDC to make my PC side software easier to develop.

 

Thanks!

0 Kudos