Hi guys.
I am using in my system micro MCF51JM128. According the datasheets, if i configure micro to work as a device on
Full-speed( I mean 12 MB/sec or 1.5 MByte/sec), this is a speed that i expect from USB bus in this case. But actually, i get only 550 KByte/sec speed and this is not enough. I need at least 1 MByte/sec. I am using by ALL free endpoints for output.I have only one endpoint for input,so the solution to change Input endpoint to be altered on-the-go(OTG) can improve the throughput, but it is not a good enough solution,because still i can not get 1 MByte/sec.
Anyone faced this problem before?How to solve it, if u faced this problem?
Thank,you Slava.
P.S. I am using USB driver that i found in Freescale application CAN_USB bridge.Maybe the problem is in this driver.
http://www.usb.org/developers/usbfaq#band1
From the above link, read Table 5.2 in the "Universal Serial Bus Specification Revision 2.0" document. As far as I can tell the maximum data packet size is 64 bytes with 45 bytes of protocol overhead, resulting in a maximum bandwidth of 832,000 bytes/second. That matches what I measured with HDTune.
http://en.wikipedia.org/wiki/Usb#Transfer_speeds_in_practice
"For USB 1.1, an average transfer speed of 880 KiB/s has been observed."
> P.S. I am using USB driver that i found in Freescale application CAN_USB bridge.
Find and read AN3631.pdf. It might refer to your stack.
I've just plugged a USB Stick into the USB1 hub that is part of the Keyboard on my desktop computer. I am running "hdtune" to measure the maximum read rate from the stick. The maximum I can get is 800 kbytes/second.
If I can't get to 1 MB/s on an 8-core multi-gigahertz desktop computer, then maybe you won't be able to get that speed on an embedded device either.
===
If you want to find out what it is doing, start measuring your execution timing.
First though, is your CPU running at the expected speeds? Is it running at the proper clock rate and is the Cache set up properly?
The easiest way to start measuring timing this is to start a DMA timer (or equivalent) in a free-running mode. I'm using an MCF5329, and we have a 32-bit DMA timer free-running at 1MHz. At the start of the function you want to measure the execution time of, read the timer count register, and at the end of the function, read it again and subtract them. Print the result on a serial console or stash it away in a variable you can later inspect with the debugger. You can work through the code trying to find where it is stuck and maybe what it is waiting on. That will also tell you if the CPU is executing code at the expected rate.
You can also measure "how many times per second" you're running through the calls to the driver, or how often your idle loop is running. Are you calling the driver often enough?
Another thing that can help is to read and/or write bigger blocks of data through each call to the driver, if it supports this.
For the next level of instrumentation, and yes, you do really have to do this sometimes
I've just added two profiling systems to the code I'm working on. Our code runs different threads at different interrupt levels, mainly IPL0, 1 and 3. I've got code that is called when the mainline at IPL0 comes out of idle and also by all interrupt service routines at their start and end. This captures the microsecond timer and logs the execution time of all of the different priority levels (correcting for higher levels interrupting lower ones). Once per second it prints out the total execution time for all threads for that second.
The other profiling system allocates a big block of RAM the same size as the code, and sets up a timer to interrupt at IPL7 every 13us. It looks back on the stack to get the interrupted program counter, and uses it as an index into the block of RAM which is used as a histogram of where the code was at. After a 30 second run it dumps the histogram to the console port. That tells me exactly where the code has been. Combine that with the MAP file (or <nm -n code.elf | grep " [tT] ">) to get symbols, merge with the histograms, sort and that gives total execution time by function.
If you're using a CF chip with code in FLASH and a small amount of RAM then the latter probably isn't an option.
I'm getting information like the following:
sh prof.sh ../../firmware.elf profile.txt | sort -n -k 2 | tailadvmaths 79419 3.44%memcpy_moveml 80368 3.48%RllDecompress 107538 4.65%pointCubic 200560 8.69%CanvasCopy 364142 15.77%main 955712 41.41%
"main()" is the idle loop. The code is 59% busy and 41% free. CanvasCopy is the function taking the most time. Looking at the histogram for code within that function:
0x4012D5F2 = 10x4012D5F4 = 50x4012D696 = 3400x4012D698 = 820x4012D69C = 4250x4012D6A0 = 1230x4012D6A4 = 165820x4012D6A8 = 175620x4012D6AA = 864330x4012D6AC = 85560x4012D6B0 = 43260x4012D6B2 = 37740x4012D6B6 = 71710x4012D6B8 = 4585
Note that CanvasCopy registered 364,142 counts. One instruction above is taking 86,433 counts, which is 24%, and this is inside a loop that has 40 or more instructions in it. So 24% execution time from 2.5% of the loop? Worth looking at:
4012d6a0: 6000 0168 braw 4012d80a src.u32 = *(srcPtr.p32++); 4012d6a4: 206e fff8 moveal %fp@(-8),%a0 4012d6a8: 2610 movel %a0@,%d3 #### THIS ONE #### 4012d6aa: 5888 addql #4,%a0 4012d6ac: 2d48 fff8 movel %a0,%fp@(-8) if (a_eMode == xxxx) 4012d6b0: 4a8c tstl %a4 4012d6b2: 6700 0128 beqw 4012d7dc <CanvasCopy+0x4f6>
The above is part of a function copying some picture data around. The instruction taking 24% of the whole execution time is stuck and waiting for the data to be read from main memory into the data cache. So it is being limited by the memory read time. If only the CPU had a "cache touch" instruction (it doesn't).