Hi all,
We have been battling issues with USB memory stick support for some time with the i.MX28 Windows CE BSP. We have implemented a number of 'fixes' ranging from Microsoft updates, Errata Work-arounds and general Windows CE fixes from other BSPs or user experiences found online and in these forums.
We have an issue where after several insertions of a memory stick the USB port appears to lock up and on the next device attachment the 'AttachDevice' process fails at DEVICE_CONFIG_STATUS_SCHEDULING_GET_DEVICE_DESCRIPTOR_TEST. This is actually the first point in the attach process that an attempt is made to communicate with the device and uses ENPOINT 0 on a control pipe created during earlier stages of the AttachProcess.
Once the port has failed it no longer responds to any other USB device and fails at this same point every time. The pipe has the halt flag set and is non-functional. The only way to recover seems to be to power cycle the unit.
We have found forum posts that recommend increasing the delay from ResetAndEnablePort (the previous step) and sending any data. By default in the BSP this is 10ms but some devices may not be ready to communicate at this point and still coming out of reset. We have tried increasing to 100ms and the results initially seemed promising but we could eventually get it to fail again after repeated insertions.
We have used a USB logger to see what happens on the bus through several insertions to failure. For each successful attach and detach we see no bus errors and can decode the SCSI packets. The last successful transfer on removal was the last SCSI Test Unit Ready packet which is sent to poll for disk removal.
When we insert the disk and there is a failure we just see babbled data out of the USB port. There isn't a single valid packet but random data bytes being received by the analyser following device reset. The analyser cannot tell if the memory stick is babbling or the i.MX28 processor as it is just receiving random bytes constantly.
I have attached the last successful transfer below (note that NAKs and SOF are hidden). The memory stick was unplugged in the middle of this screenshot which is why the analyser detected suspend (no bus activity). The Reset following the suspend is when the stick was re-inserted. You can see that following this we just have babbled data. The bytes are random values.
On a successful remove / attach we see that after the suspend there are two resets and we have the initial 'get descriptor test' followed by the address set and the rest of the successful enumeration.
I have been adding debug messages at various points to try and figure out what could be going on. I am not sure why the USB port has the random data on it. Perhaps there is a corrupt qHead/qTD that is just running through memory sending out random data bytes but the data is not in any of the proper transfer types (SETUP/IN/OUT).
The transfer queue should be torn down on removal and cleared out and then recreated on the first transfer on attach.
Has anyone seen anything similar or could offer some debugging advice?
Thank you in advance, Mark
Solved! Go to Solution.
The issue can be closed
Further to the above, an automated test has proven over 2800 insertions and removals at random intervals with the code work-around (before test stopped) but fails after <10 repeatedly with the original code.
Further to this if a hub is plugged into the board there is a second function that is used to put the local port in suspend mode. This function also does not follow the process required in the erratum so could lead to the port getting into a stuck state:
C:\WINCE600\PLATFORM\COMMON\src\soc\COMMON_FSL_V2_PDK1_9\MS\USBH\USB2COM\cdevice.cpp
BOOL CHub::SuspendResumeOffStreamDevice(IN const UINT address, IN const BOOL fSuspend)
Avoid calling BSPUsbhPutPhySuspend() in this function.
Mark
Hi all,
I have had confirmation from our NXP FAE that Erratum ERR004535 USB Suspend and resume flow clarifications does also affect the iMX28 (despite it not being in the current Errata document) and the document will be updated at some point.
Having also discovered that ERR006308 also affects the iMX28 but not yet documented it may be safe to question all USB errata showing on the iMX6 range but not yet documented for the iMX28.
Anyway the next move for me is to either apply the work-around for ERR004535 in the Windows CE code or just prevent the PHY suspend as shown earlier.
The impact of this erratum is "Resume Error if the correct flow not adhered to". It seems an atomic sequence of writes is required without interrupts to put into low power suspend mode.
Mark
witekio please continue with the follow up.
I have an update thanks to a pointer from Peter Chen at NXP. Peter suggested a number of things - one of which was preventing the driver from putting the USB PHY in suspend. This must not be confused with system level suspend - it seems that when no device is connected the USB PHY is put into suspend and resumed on attach.
Within the CE BSP there is a function that determines whether this is supported:
//------------------------------------------------------------------------------
// Function: PowerDownSchemeExist
//
// Description: This function is called to check if we can stop the system clock
// and PHY clock, it is determined by IC capability, if VBUS and ID
// interrupt without clock can wake up the system, we can enter
// low power mode, else we can't
//
//------------------------------------------------------------------------------
BOOL PowerDownSchemeExist(void)
{
return TRUE;
}
I just changed the above to return FALSE to prevent the driver issuing the PHY suspend/resume process. With this change the port no longer gets into the stuck state.
So now I have to understand why the PHY suspend/resume process is not working for some products with some USB devices. Either that or determine whether we really need the PHY to be put into suspend mode. Power is not an issue.
So technically this issue is still not resolved but my work-around could be to prevent PHY suspend with the above change.
Mark
I am wondering if the following applies to the iMX28 - this is an Errata for the i.MX6 and also appears on the Vybrid controllers. These all use the ChipIdea IP and I am not sure if it is the same version as the iMX28:
ERR004535 USB: USB suspend and resume flow clarifications
This seems to relate to resume issues under certain timing. Could anyone from Freescale identify if this would impact the iMX28?
Mark
As a quick test I also tried with SBUSCFG at 6 and SDIS turned off (streaming mode enabled) which is the default for the BSP but this has the same generated packets and same error. I had changed SBUSCFG to 0 and enabled SDIS to 1 after reading other threads on iMX28 USB errors but this made no difference.
Just another thought. As the controller doesn't seem to be able to generate the SOF packets correctly then surely the error condition can't be because of corrupted transfer descriptors as the controller creates the SOF itself.
I searched through a valid transfer and found the same data as the CRCs for valid SOFs so I am certain this is just generating Sync and CRC fields:
If you look below I have lined them up. On the left is a valid sequence of SOFs with rolling frame number. On the right is the data seen by the analyser in the error state when just the CRC seems to be output. You can see the 'Data' field in red on the right follows the CRC field sequence for valid SOF on the left.
In fact the 0x0F 'CRC' is misinterpreted by the analyser as a PID (last four bits complement the first) as does the 0x1E later.
So I am pretty confident now that in this state the controller is generating SYNC and CRC but not the data in between. This goes for SOF and other packets too.
I hope this additional info is useful to help with this issue.
Kind regards, Mark
Sorry for the update delay. I have been working on other projects while trying to get to the bottom of this USB issue!
I have been performing lots of different diagnostics tests, checking the asychronous qHead list, the qTDs of each qHead. I have also been scanning through various USB protocol analyser logs and we also bought a new oscilloscope with built-in USB decode which has helped.
As an update I can confirm that I can still reproduce the issue where the USB port seems to lock up after several insertions/removals of a USB device.
Once in this state attaching any USB device the port no longer generates valid packets. I have been able to see this on a USB protocol analyser but also seen it on the oscilloscope. The packets start with valid SOF but the data content is invalid. In fact I believe the SOF and CRC are generated but the data (including PID) is not valid. As the SOF packets contain a rolling frame number the CRC would be rolling too.
An example invalid packet captured on the scope - these start being generated the second the USB device is connected and port enabled:
You can see from the above that the packet starts with a sync, then a random byte (shown as PID error). I believe the random byte to be the CRC.
I noticed a pattern in my USB trace logs when the port is in the stuck state. Within the random data (which I believe to be attempted SOF with missing PID and frame number) I can see attempts at communication when the attach process attempts to send the initial setup packet for the first control transfer.
The reason why I think this is this transfer is that the pattern I see is a reset condition followed by three attempts at sending the same data. This is repeated three times (the attach process tries three times) and eventually the port stops sending anything as the attach process disables the port on error.
This is the expected pattern of data transfers during the USB device attach process. It tries to send an initial control transfer which will try three times (USB CERR counter). If this fails the attach process resets the port and tries again. It tries comms three times and then disables the port.
An example of this is shown above. We see the pattern of 08 packet then three packets ending in D7 29. This process is attempted three times.
Interestingly when the port is functional this same initial control transfer looks as follows:
If you look at the expected transfer the Setup Packet should have CRC 8 and the following Data0 packet CRC D729. When the port is in its error state I see a data 8 (sync followed by 0x08). Then a data packet ending in D729.
So from this comparison I strongly believe that in the error state the port is trying to send the packet but just the CRC seems to be correct.
Now analysing the data packet differences there seems to be some correlation between the data in the valid packet and when in the error state but this is not clear.
Error State: 60 00 80 00 00 10 00 D7 29
Working State: 80 06 00 01 00 00 08 00 with CRC D7 29.
From my analysis is definitely seems that the port can get into a state where the Sync and CRC appears to be generated but the data in between is not valid.
This shows in the fact the SOF are missing the PID and frame number and the attempted control transfers are incorrect too. But the CRCs still seem to be valid.
I am assuming that the controller generates the sync pattern itself as this is always the same but perhaps the PID and rest of the data being fetched is corrupted. Interestingly though (I can't believe this is just chance) the CRC seems to be valid too.
I have looked at the transfer being set up and this seems to be correct. This is probably why the CRC is correct. I am now puzzled as to what could be causing the resultant corrupted transfers.
From the slight correlation between the data in the error state and working state I am wondering if this is something else to do with how the data is transferred on the internal busses. For reference I have SDIS set and SBUSFCG at 0. I will retry with these set to my original settings of SDIS off and SBUSCFG at 6.
This may need help from Freescale as well as Adeneo.
Kind regards,
Mark
adeneo-embedded please continue with the follow up.
Hi Mark,
The scenario is very clear now, just one element is missing: whether the target image has Hcd_hsotg.DLL in it.
I was able to build two usable images from the iMX28-EVK BSP; I inspected BIB/REG files of both.
Both possible HOST-only selections are outlined below; it would be interesting to find out if OTG has any impact at all.
(1) USB HOST only build without OTG
One of the builds had just this selection from Freescale i.MX28 EVK: ARMV4I Catalog:
USB Devices
USB Host Device
[X] High Speed Host
It had just Hcd_hsh1.DLL in it, and the BIB/Registry files were set accordingly.
;------------------------------------------------------------------------------
; PURE HOST Registry
; @CESYSGEN IF CE_MODULES_USBHOST
; @CESYSGEN ENDIF CE_MODULES_USBHOST
; @CESYSGEN IF CE_MODULES_USBHOST
[HKEY_LOCAL_MACHINE\Drivers\BuiltIn\HCD_HSH1]
"Prefix"="HCD"
"Dll"="hcd_hsh1.dll"
"Order"=dword:15
"Class"=dword:0c
"SubClass"=dword:03
"ProgIF"=dword:20
"MemBase"=dword:80090000
"MemLen"=dword:00001000
"irq"=dword:5C
"HcdCapability"=dword:4 ;HCD_SUSPEND_ON_REQUEST
"OTGSupport"=dword:0
"OTGGroup"="01"
; @CESYSGEN ENDIF CE_MODULES_USBHOST
;
; END OF PURE HOST Registry
;------------------------------------------------------------------------------
(2) USB HOST build with pure host OTG
The other build had a radio button from the OTG group also checked in addition to the selection above:
USB Devices
USB High Speed OTG Device
( ) High Speed OTG Port Full OTG Function
( ) High Speed OTG Port Pure Client Function
(x) High Speed OTG Port Pure Host Function
It had both Hcd_hsotg.DLL and Hcd_hsh1.DLL in it, and the BIB/Registry files were set accordingly:
;------------------------------------------------------------------------------
; PURE HOST Registry
; @CESYSGEN IF CE_MODULES_USBHOST
[HKEY_LOCAL_MACHINE\Drivers\BuiltIn\HCD_HSOTG]
"Prefix"="HCD"
"Dll"="hcd_hsotg.dll"
"Order"=dword:15
"Class"=dword:0c
"SubClass"=dword:03
"ProgIF"=dword:20
"MemBase"=dword:80080000
"MemLen"=dword:00001000
"irq"=dword:5D
"HcdCapability"=dword:4 ;HCD_SUSPEND_ON_REQUEST
"OTGSupport"=dword:0
"OTGGroup"="01"
; @CESYSGEN ENDIF CE_MODULES_USBHOST
; @CESYSGEN IF CE_MODULES_USBHOST
[HKEY_LOCAL_MACHINE\Drivers\BuiltIn\HCD_HSH1]
"Prefix"="HCD"
"Dll"="hcd_hsh1.dll"
"Order"=dword:15
"Class"=dword:0c
"SubClass"=dword:03
"ProgIF"=dword:20
"MemBase"=dword:80090000
"MemLen"=dword:00001000
"irq"=dword:5C
"HcdCapability"=dword:4 ;HCD_SUSPEND_ON_REQUEST
"OTGSupport"=dword:0
"OTGGroup"="01"
; @CESYSGEN ENDIF CE_MODULES_USBHOST
;
; END OF PURE HOST Registry
;------------------------------------------------------------------------------
The suspicion of runaway qTD is very reasonable; however, time delay may be just workaround. Nothing like that has been ever implemented in MSFT or FSL USB port drivers.
Keep in mind that there are at least three threads (in order of low-to-high priority) which have access to queues & transfers:
USBD calling thread;
USBD event callback thread;
HC IRQ handler.
Each shared memory resource between these threads should be “volatile” qualified, to prevent optimizing compiler from caching a pointer or value on the local stack.
Another technique which might be applicable (albeit USB descriptors should always be allocated in non-cached memory) could be, CacheRangeFlush() call after release of queue descriptor.
= = = = =
All-in-all, I cannot imagine finding the root cause of the problem without debugging.
To confirm or deny the suspicion of time-related issue while freeing transfer descriptors, a debug session is a must.
My recommendation would be to focus on the function void CHub::AttachDevice() in <WEC7root>\Platform\CommonSRC\SoC\Common_FSL_V2\MS\USBH\usb2com folder.
Setting up ZONE_ATTACH and ATTACH_DETAIL to obtain traces during device attach process might also be helpful.
Needless to mention, USB Analyzer should be connected all the time and live results should be monitored wile attachment state machine is executing.
The state machine has only three stages before reaching DEVICE_DESCRIPTOR stages.
The stage immediately preceding DEVICE_CONFIG_STATUS_SCHEDULING_GET_DEVICE_DESCRIPTOR_TEST is no other but DEVICE_CONFIG_STATUS_RESET_AND_ENABLE_PORT.
Incomplete or improper reset of a port is very likely to have dire consequences, and this triage scenario may help:
There is exactly one pipe – the control pipe 0 – which is employed in the process; all its descriptor allocations may be observed while at breakpoint and any corruption of pointers/descriptors can be detected.
Before and after reset, a dump of HC port registers may also be useful for differential analysis.
Thanks,
Thanks adeneo for such a detailed response!
We are using the host-only port and the hcd_hsh1.dll. The usb memory sticks are being plugged into this port which is limited to full speed operation.
We have an internal micro usb connector hidden inside the product connected to the otg capable port for os loading. The driver for the otg port is included in the image but prevented from loading at start up. This was so that if we wanted to use the port our app could start the otg driver. However we do not use this.
So only the hcd_hsh1.dll should be active.
From my analysis of the driver I agree this will be difficult to track without debugging. sadly the release product does not have an Ethernet port for connecting to VS. For debugging this I have been turning various debugmsg to retailmsg and printing out variables on the debug serial port - not ideal for this type of issue I know!
What I am unsure about is if the error state is created when on removal or when the device is next inserted. I.e. is the dequeue failing or setting up the first control transfer. I was hoping to print out the qH list on attach after the port reset and before the first control transfer (when it fails to attach it doesn't get past the first control transfer and I do not see any valid data on my USB analyser).
As there are no other usb ports or devices simultaneously connected the async queue should be in an empty state on attachment of a new device. It would be good to verify this on connect. This won't necessarily point me to the root cause though.
Annoyingly once in this state no other devices work. At the very least it would be good to ensure the async queue is fully reset on failure.
Thank you again for your help. I will keep this updated and work on this next week.
Mark
I have been continuing to debug but with no luck. I decided to search for WEC7 updates to do with USB. Although we use CE 6.0 I am sure the driver will have been used as a base in WEC7 so there may be fixes applied which haven't been released for CE 6.0.
I found this:
FIX: USB host cannot enumerate new device after unexpectedly removing an active USB device in Windows Embedded Compact 7
Symptoms:
Assume that you remove an active USB device from Windows Embedded Compact 7 unexpectedly. Then, the USB host cannot enumerate new devices any longer.
Cause:
The issue occurs because the USB host controller is constantly trying to finish a bulk transfer to the removed device. Because of this, further asynchronous transfers (Bulk, Configure) are not processed any longer. The USB host controller sets the Transfer Error Flag on the transfer, but neither stops the transfer nor triggers an interruption.
This seems to be fixed in Trans.cpp:
Trans.cpp | 56,043 | 15-Oct-2014 | 09:08 | Public\Common\Oak\Drivers\Usb\Hcd\Usb20\Ehci |
From the description above this describes my issue - I remove a USB device and sometimes end up with a locked up port sending data.
To save me setting up a new VM to download compact 7 and ultimately this update, could Adeneo kindly do me a favour and have a look to see what changes were made in trans.cpp for KB3009114 and let me know (assuming you already have a machine with Compact 7 and the latest updates!).
Thank you in advance! Hopefully the above issue also applies to CE 6.0 too and I can back port it.
Mark
I managed to compare the differences for the WEC7 fix detailed above. It is a single line in CQTransfer::PrepareQTD where if the last qTD then set CEER (CERR?) to 3.
This code is quite different to the CE 6.0 code but seems to be called from CQTransfer::AddTransfer which also exists in the CE 6.0 code, however different.
It looks like WEC7 and CE6.0 are quite different in their USB EHCI code so hard to compare. Frustrating as the fault description above matches my issue so I was hoping to apply the changes to the CE 6.0 code.
I can see why CERR is set to 3 - on multiple transfer errors this would get to 0 and trigger an error interrupt. Presumably before this the value of CERR was 0 causing unlimited retries so a detatch mid-transfer could leave the last qTD retrying forever.
In the CE 6.0 code it seems that all transfers have the CERR (actually CEER for some reason in the code) set to 3 in CQTD:IssueTransfer function. Unless there is somewhere I am missing.
Mark
Hi Mark,
Here is the results of our investigation.
USB HOST CONTROLLERS
iMX28 has two USB host controllers, USB0 w/ OTG and USB1 with host support only.
It may be important to know which port is being used when failure happen, specifically if OTG is involved.
FSL BINARIES and MSFT QFEs
The binaries in the image (as the call stack would be) are:
USBDISK6.DLL
– from …\PLATFORM\COMMON\src\SOC\common_FS_V2\MS\USBCLASS\USBDISK6\*
– from PUBLIC branch (therefore up-to-date via MSFT QFEs)
– from PUBLIC branch (therefore up-to-date via MSFT QFEs)
Hcd_hsh1.DLL corresponding to EHCI.DLL of PUBLIC branch
– from …\PLATFORM\iMC28-EVK\SRC\drivers\USBH\HSH1 and
– from …\PLATFORM\COMMON\src\SOC\common_FS_V2\MS\USBH\*
If OTG selection ”pure host” from OTG Catalog group is made
Hcd_hsotg.DLL
– from …\PLATFORM\iMC28-EVK\SRC\drivers\USBH\HSOTG
It makes sense to compare the folders in …\PLATFORM\COMMON\src\SOC\common_FS_V2\MS\USBH\ to those in …\PUBLIC\COMMON\OAK\DRIVERS\USB\HCD\USB20\ in order to detect if any QFEs from MSFT need to be imported to FSL’s branch; KB2516902 and KB980435 are involved.
Likewise …\PLATFORM\COMMON\src\SOC\common_FS_V2\MS\USBCLASS\USBDISK6 should be compared to …\PUBLIC\COMMON\OAK\DRIVERS\USB\CLASS\Storage\Disk\SCSI2 with regard to KB2635840 and KB980435, even if this binary is unlikely to be causing the damage.
Once this source analysis is complete, and MSFT QFEs ported to FSL branch, a new trial run shall be executed.
= = = = =
One can only speculate that the fix could be as simple as adding "volatile" qualifier to a variable in some USB structure which is used by two different threads yet considered never-changeable; alternatively it could be as complex as hunting single bits in the USB memory structures or USB Host Registers.
At this time the best suggestion would be, to determine whether the fault is hardware or software related. The setup has USB probe attached which sniffs out the traffic. Further, it has been determined that the failure occurs at certain specific moment in the USB attachment state machine, and debug breakpoint could be inserted right there. Thereafter, if USB Host Controller runs properly, when the host is in break state in debugger no USB packet traffic would happen; if babble is observed on the wire while host is in break state then it is the hardware at fault. If, on the other hand, USB wires are calm while the target device is in break state, then the babble is definitely caused by bad software handling. As mentioned before, the odds are that SW is at fault and "volatility" may be applicable.
Thanks!
Hi,
Thank you for your update. I have already been through porting the Microsoft QFEs from the public folder to the MS folder within the BSP. This was something I noticed some several years ago when I realized that the Freescale BSP had a clone of the Microsoft driver code and no QFEs would be applied. I painstakingly performed diff-merges on all of the cloned files. The only QFE I haven't ported is
When you use the suspend or the resume feature on a USB device, the Windows Embedded CE 6.0-based USB client driver is not notified of a remote wake-up event.
We are not using suspend/resume on windows CE. However I will look at porting this across just to be sure.
With regards to the driver this is on the host-only port.
Once the USB controller is in this state the random data bytes are produced for any USB device plugged into the port. I therefore believe that the host port is generating the sync (every ms) and then a random byte after it. A USB device should not send any data in response to the sync byte as it requires a PID after to instruct it what to do (IN/OUT/SETUP). So for all devices to have the same random data it is most likely that the USB host is generating it. It would be odd for a collection of devices to simultaneously start outputting a single byte of data following a sync byte.
In normal operation for the USB controller to be outputting data periodically it must have something scheduled to transfer, potentially with a corrupt pointer. I am wondering if printing out the transfer queue pointers/structures would help identify if the controller has a bad transfer queue set up?
There is another errata that concerns met: Errata 2858 USB controller may access a wrong address for the dTD (endpoint transfer descriptor) and then hangs.
I have tried using other SBURSTCFG modes to work around this with no luck but perhaps there is something in postponing freeing the last TD in case memory could be re-used while the controller is still pointing to it. It seems to make sense to me that if the memory is re-used before the controller has finished with it then the controller could be issued with bad pointers in its transfer queue and start babbling out data.
The linux world has this 'postpone free last Td' fix. Have you implemented this for your versions of the Windows CE USB drivers in your BSPs? Could you advise on how to do this?
Kind regards, Mark
Hi Mark,
Currently, we have one of our engineers assigned to this case. We will provide an update to this thread as soon as we can.
Thanks!
Thank you adeneo! I hope you can help!
I have been looking in more detail at the USB protocol analyser output. On a failed attach there is random data on the USB bus shown in my protocol analyser. But when I look at the timing and bit levels I can see that they are produced every 1ms (USB frame duration at full speed) and each attempt at a transfer starts with a valid SYNC packet of 0x01.
The PID that follows the SYNC is the random aspect and as the protocol analyser does not detect a valid PID it just shows 'data'.
I have included an example below and every packet starts with a valid SOF but then a random PID:
So to me the controller is consciously trying to send something but that something has an invalid PID which helps strengthen the fact that it might not be real data.
Occasionally (by chance) the PID field happens to have the second four bits as the complement of the first four and match a valid PID so my analyser shows an attempt at decoding it, but is is not a real packet. The random data just happened to have something that matched a valid PID. The rest of the packet is missing because it wasn't intentional. Anyway in this case it happened to look like the start of a PING?!
So I now have to understand what puts the controller in this weird state and how to prevent it or have a recovery mechanism on device re-insertion to save having to power cycle.
Initially it could be one of the Erratum such as suggested by Yuri (although according to the iMX6 version of this erratum ERR006308 the host controller only sends SOF packets in this state but I am not seeing valid SOF.)
The only other erratum is Errata 2858 USB controller may access a wrong address for the dTD (endpoint transfer descriptor) and then hangs. I have tried using other SBURSTCFG modes to work around this with no luck but perhaps there is something in postponing freeing the last TD in case memory could be re-used while the controller is still pointing to it.
Hopefully someone can help me understand this or if there are any fixes applied to other BSPs that I may need. Windows CE support is not as readily available as Linux though!
Mark
Deactivated user can you help to review this case?
I tried setting the USBMODE SDIS bit as per Errata ERR006308 but I was still able to get the USB port to hang in the same way with several insertions / removals of memory sticks.
To be honest the iMX6 errata for this same issue says that the controller outputs SOF in the stuck-state which is not the same as my issue where it is outputting random data.
I am therefore still looking for the root cause of this.
Mark