Mass Storage transfer speed

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Wed Nov 21 07:24:59 MST 2012
Using the LPC11U14 and LPCUSBlib 0.98, I created a Mass Storage device which reads data from SD card. The problem is that I'm not able to reach transfer rates above 250 kb/s, while a cheap off-the-shelf cardreader can easily accomplish 1 mb/s. Im not sure if it's the SPI rate, USB clock, or CPU speed which is the bottleneck, so a couple of questions:

     - SPI is running at 12 mhz, am I right that this should be enough?
     - Would it make a large difference to implement multi-block SPI transfer?
     - VIRTUAL_MEMORY_BLOCK_SIZE is 512 in the MassStorage example, would increasing it improve performance?
     - Is this CPU even capable of doing 1 mb/s transfer from MicroSD? If so, is there some example code?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Gonzalo_Sipel on Wed Dec 31 06:06:10 MST 2014
Dear friends,

I'm trying to read and write a SD card using USBmassStoragedevice. I can read my SD card without problem but when I try to write the library goes into an infinity loop.
Despite the transfer rate, Have you been able to write files without problems?
I'm using a LPC1788 which doesn't have USB ROM driver
Any suggestion about this?

Regards

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Madara on Mon Nov 25 03:47:14 MST 2013
I am able to read from SD but cannot write to it. On writing to the card the card gets corrupted.

The logic I am using to write to card is
while(TotalBlocks--){
offset = 0;
length = 512;
memset(Buffer,0,512);
while(!Endpoint_IsReadWriteAllowed());
while(length > 0){
bytestowrite = USB_get_data(temp);
TRACE("B%d",bytestowrite);
if(!bytestowrite)
break;
memcpy(Buffer + offset, temp, bytestowrite);
offset += bytestowrite;
length -= bytestowrite;
}
sdmmc_write(Buffer, BlockAddress,1);
if (MSInterfaceInfo->State.IsMassStoreReset)
return 0;
BlockAddress++;
}
if (!(Endpoint_IsReadWriteAllowed()))
Endpoint_ClearOUT();
}

Is there a problem with my code. USB_get_data() is simply a function to of 16 Endpoint_Read_8() ;

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Tue Dec 04 08:46:38 MST 2012
I finally solved the problem by:

- Applying the suggested bugfix to DcdDataTransfer
- Using memcpy to fill usb_data_buffer_IN instead of using Endpoint_Write_Stream_LE

I'm now able to get the same speed as the cardreader (1 mb/s).

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Tsuneo on Mon Dec 03 01:34:26 MST 2012
** sigh **
Muis,

> I tested it on an USB 2.0 port

When a Full-Speed (FS) device is plugged in to a PC "USB 2.0 port", a companion FS host controller is assigned to the device, instead of a High-Speed (HS) host controller. Then, you see lower performance on the host side. You should use an external USB 2.0 hub.

> how can I get 1 mb/s when using a normal cardreader on the same port?

The card reader should work on High-Speed, as I said in my first post.
Did you confirm it?

You should learn USB basic, first.

Tsuneo

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Sun Dec 02 20:41:44 MST 2012
I tested it on an USB 2.0 port, but if the host was the bottleneck, how can I get 1 mb/s when using a normal cardreader on the same port? So I'm pretty sure the host is not the bottleneck.

However, I did receive an interesting message from NXP support about a bug in NXPUSBlib:

"There was some faulty logic in the code that sets up the DCD descriptors that was limiting the transfer buffer size to 64 bytes. When this is increased to 512 bytes, which is the largest size for a single descriptor, the hardware should be sending out data as fast as it can.

In Endpoint_LPC11Uxx.c in the function DcdDataTransfer() replace the first line, and if statement, with this line:

if (!IsOutEndpoint(EPNum)&&(EPNum==1)) // Control IN endpoint

This should at least double your throughput."

I'm going to try it out tomorrow. I must say it's strange that there are no NXP examples how to implement a MicroSD SPI --> Mass Storage application. I expected it to be part of the default examples coming with the library. Now everyone has to write its own implementation for such a widely-used task, and NXP can simply say that the performance problem is part of your own code.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by ngraum on Sun Dec 02 20:16:33 MST 2012
That's a great idea about modifying Endpoint_Write_Stream_LE(), I'll have to try it. Unfortunately in my application (dual host on the 1800) it is currently impossible for me to measure performance over a hub because nxpUSBlib does not currently support hubs.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Tsuneo on Sun Dec 02 20:11:57 MST 2012
As I said in my first post, the major bottleneck of FS MSC (Full-Speed Mass Storage Class) lies in the host side. While the host does single sector read/write, for FAT and directory, host takes 3 frames (one frame per each stage) to complete READ10/ WRITE10 SCSI command over Bulk-Only Transport.
That is, 512 bytes / 3ms = around 167 Kbytes/sec

While the host moves 4K cluster in single READ10/ WRITE10 for file sectors,
- CBW stage : 1 frame
- DATA stage : 4 frame
- CSW stage : 1 frame
4K bytes / 6 ms = around 667 Kbytes/sec

Over a USB2.0 hub, which converts FS transactions into HS (High-Speed) ones,
Single sector read/write: 512 bytes / 6 micro-frame = around 667 Kbytes/sec
4K cluster read/write: 4K bytes / 31 micro-frame = around 1,032 Kbytes/sec

In this way, a USB2.0 hub improves host-side performance significantly. Therefore, speed measurement of FS MSC should be done over a USB2.0 hub, to know the bottleneck on the device side. Did you measure your performance over a hub?

As of the nxpUSBlib code, you may optimize Endpoint_Write_Stream_LE() (EndpointStream_LPC.c) routine. This routine copies data on the Buffer[][] to usb_data_buffer_IN[] one byte by one. Using memcpy(), you'll get better performance.

Tsuneo

lpcware · ‎06-15-2016

Content originally posted in LPCWare by ngraum on Sun Dec 02 16:18:13 MST 2012
I'm interested in this thread because I've seen very similar performance in a multi-host application I am working on using the LPC1830. I am inclined to believe the bottleneck is not on the SPI side. I posted a thread not long ago with performance benchmarks also around the 250kB/s range: http://www.lpcware.com/content/forum/mass-storage-readwrite-performance-lpc18xx This leads me to believe the bottleneck is either in nxpUSBlib or in the hardware itself (unlikely).

Also I've run the speed tests that come with the example code for the NGX LPC1800 Xplorer board and I see speeds anywhere from 8-10mbit/s on a MicroSD card (which as I'm sure you know is just SPI).

I should note all of my testing was done on the Xplorer dev board so it is very unlikely the problem is due to incorrect PCB layout or pullup/pulldown resistor configuration.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Wed Nov 28 05:37:03 MST 2012
>While the USB engine processes this transfer with the endpoint buffer, your firmware reads out next sector (block) to the "buffer".

Yes, but to read each byte from SPI, the code runs a while-loop. Could it be the USB engine has lower priority than then SPI-loop, so that it doesnt transfer any data while I'm waiting for the next SPI byte? The frequency of the SPI clock is 12Mhz.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Tsuneo on Tue Nov 27 09:40:58 MST 2012
> It reads 512 byte blocks from SPI, and sends them to the USB endpoint. When waiting for the SPI block, it sends nothing to the USB endpoint, and when sending to the USB, it reads nothing from SPI.

Actually in your "Multiple block read" code, a part of USB transfer overlaps on the process of reading out from SD-Card. Endpoint_Write_Stream_LE() copies the data from the "buffer" to the internal endpoint buffer. Endpoint_ClearIN() arms the bulk IN endpoint, and USB transfer starts in background. While the USB engine processes this transfer with the endpoint buffer, your firmware reads out next sector (block) to the "buffer".

I believe the speed bottleneck lies in the SPI side.
What is the frequency of the SPI clock? You may increase it up to 25MHz, after the first negotiation.

Tsuneo

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Tue Nov 27 02:10:45 MST 2012
Im using a stock LPCXpresso board without modifications, so I assume its running at FS.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Tsuneo on Sun Nov 25 18:55:52 MST 2012
> Stupid question but are you ensuring that the USB interface itself is FS and not LS by using a pull-up resistor on D+ and not D-?

The USB engine on LPC11Uxx doesn't support LS, it does just FS.
If the pull-up would be attached to D-, it couldn't enumerate.

Tsuneo

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Scribe on Sun Nov 25 13:52:01 MST 2012
Stupid question but are you ensuring that the USB interface itself is FS and not LS by using a pull-up resistor on D+ and not D-?
The LPC11Uxx can run at LS without a crystal, lowering cost and size, so it's not unusual to find configurations for LS.

Once I've given 0.98a a try I'll let you know what I'm getting off my SPI-based Flash.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Muis on Thu Nov 22 08:12:55 MST 2012
I implemented multi-block read, but it didnt make a big difference. My main code looks like this:

if (count == 1) {/* Single block read */
if (send_cmd(CMD17, sector) == 0){ /* READ_SINGLE_BLOCK */

if (rcvr_datablock((void *)buffer, 512))
{
count = 0;
}

while(!Endpoint_IsReadWriteAllowed());
Endpoint_Write_Stream_LE((void *)buffer, 512, ((void *)0));
Endpoint_ClearIN();

}
}
else {/* Multiple block read */
if (send_cmd(CMD18, sector) == 0) {/* READ_MULTIPLE_BLOCK */
do
{
if (!rcvr_datablock((void *)buffer, 512))
{
//break;
}

while(!Endpoint_IsReadWriteAllowed());
Endpoint_Write_Stream_LE((void *)buffer, 512, ((void *)0));
Endpoint_ClearIN();

} while (--count);
send_cmd(CMD12, 0);/* STOP_TRANSMISSION */
}
}

It reads 512 byte blocks from SPI, and sends them to the USB endpoint. When waiting for the SPI block, it sends nothing to the USB endpoint, and when sending to the USB, it reads nothing from SPI. Would there be a simple way to make it operate synchronously?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wmues on Thu Nov 22 08:01:52 MST 2012
- SPI at 12MHz is fast enough because USB is 12MBit only.
- Multi-Block-SPI transfer will make a big difference.
Multi-Block write transfers will enhance wear leveling.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Tsuneo on Thu Nov 22 04:47:55 MST 2012
> The problem is that I'm not able to reach transfer rates above 250 kb/s, while a cheap off-the-shelf cardreader can easily accomplish 1 mb/s.

The USB SIE (engine) on LPC11U14 is a full-speed (FS: 12Mbps) one. Maybe your cheap card reader works on high-speed (HS: 480Mbps).

Because of delay on PC host controller, FS MSC (Mass-Storage Class) can't get so much speed. Major PC OS set up host controller, so that a transfer completion interrupt occurs at next SOF timing. It means each stage of Bulk-Only Transport takes 1ms, at least, on FS bus. This completion delay decreases the transfer speed of MSC so much.

When the FS MSC device connects to a PC over a USB2.0 hub, you may see better transfer speed. USB2.0 hub convert FS transactions into HS one. HS host controller (EHCI) runs on micro-frame (125 us interval), instead of 1ms frame. The delay on host controller also decreases.

Tsuneo