Bad USB device mode DMA performance in RT1064/1062

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Bad USB device mode DMA performance in RT1064/1062

2,651 Views
udoeb
Contributor II

Hi,

We experience data loss in USB device mode when isoch and bulk endpoints are used in parallel. We analyzed this issue in detail (we specialize in USB technology) and found the following:

1) If one isoch IN + one isoch OUT endpoint, with 800 bytes packet size and 125us transfer interval each, are running in parallel then there is no issue.

2. If we add one bulk IN + one bulk OUT endpoint, with 512 bytes packet size each, and continuously send data through these endpoints then we experience data loss on the isoch endpoints. The bulk traffic seems to prevent the USB controller from priming the endpoint DMA in time. We observed that the prime operation for the isoch endpoints takes very long (up to 100us). This leads to two effects:

a) The controller discards isoch OUT packets because the endpoint is not ready yet when the next isoch OUT token arrives from the host. This causes data loss in the isoch OUT stream.

b) The controller responds to isoch IN tokens from the host with a zero packet because the endpoint is not ready yet when the token arrives. The controller sends the isoch data in the subsequent microframe. So isoch data will be delayed which is not acceptable.

We suspect that the root cause for the issue is poor USB DMA throughput and/or bus arbitration latency.

Questions: 

Q1) How is the USBOTG1 controller connected to the internal bus fabric? 

Q2) Is it possible to configure arbitration priority for the USBOTG1 bus master?

Q3) Are there other ways to improve USBOTG1 DMA throughput or bus arbitration latency?

Thanks,

Udo

 

Labels (1)
0 Kudos
10 Replies

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

Sorry for the later reply.

1> We did Max. USB throughout test with bulk IN endpoint with 512 bytes package size.

Calculated effective data bandwidth is 294Mbps (36.8MBps).

You could see each transaction will take 9.8us.

This data will show USB DMA data transfer capability.

Please use this data as a reference to check your test result.

pastedImage_1.png

2> Please check your USB stack interrupt service routine, if the isoch endpoints hanlding priority higher than bulk endpoints. Please make sure checking isoch endpoints status at first.

3> You mentioned "We observed that the prime operation for the isoch endpoints takes very long (up to 100us).", that was also mentioned in RT106x reference manual. Please check below picture for detailed info:

pastedImage_2.png

4> We would recommend customer to add more dTD with isoch endpoints, at least with 2 dTD of each isoch endpoint.

 pastedImage_3.png

Please let us know if previous suggestions working or not.

Wish it heps.

best regards,

Mike

0 Kudos

2,413 Views
udoeb
Contributor II

Hi Mike,

Thanks for your reply.

1> My complain was not about achievable max. throughput in bulk mode. Bulk throughput is okay. The issue happens if bulk IN with high throughput and isoch IN runs in parallel. In this case the controller does not serve the isoch endpoint in time. See my problem description above. Did you create a test setup that runs bulk and isoch in parallel?

2> We do handle isoch endpoints with higher priority. In fact, we do handle isoch DMA immediately in the ISR and handle bulk endpoint DMA deferred in an RTOS thread. We did also try to handle both in the ISR, with isoch first. This does not make a difference.

3> I know that statement from the reference manual. However, our analysis shows that it seems not to be true. If we run the isoch endpoint alone then endpoint priming takes up to 20us. So it is not delayed until the next SOF. If we run the bulk IN in parallel then endpoint priming on the isoch endpoint takes up to 120us. This seems to be caused by the bulk traffic, not by SOF. Hence we assume that with respect to DMA behavior the manual is not correct or incomplete.

4> This leads to two problems:

a) Additional TDs increase latency on the audio stream.We try to create an ultra-low-latency design.

b) This requires that we append new TDs to the QH while the DMA is already active which creates a bunch of race conditions. While the manual recommends an algorithm to avoid the races, we found that this algorithm does not work reliable under all conditions. This is a separate issue for which we could open up another thread...

Best,

Udo

0 Kudos

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

First of all, thank you for the patience.

We did a test with RT1060 SDK <audio _speaker> demo about dTD priming time. The test result is 240ns.

The dTD prime code located at USB_DeviceEhciInterruptTokenDone() function in  <usb_device_ehci.c>:

pastedImage_1.png

Test result is:

pastedImage_2.png

 

If I just check Isoch endpoint priming frequency, I just enable the first toggle and get below result: 

The Isoch endpoint priming every 1ms with data frame, which match with USB protocol.

pastedImage_3.png

pastedImage_4.png

I would recommend to refer SDK dTD priming code at USB_DeviceEhciInterruptTokenDone() function.

Thanks for the attention.

best regards,

Mike

0 Kudos

2,413 Views
udoeb
Contributor II

Hi Mike,

Thanks for looking into this and for sharing your results. I really appreciate that.

Here are my comments:

1) You are referring to the prime code in USB_DeviceEhciInterruptTokenDone(). USB_DeviceEhciInterruptTokenDone is the TD completion processing. It's hard to understand (for us) why this code includes prime code (for the next active TD) at all. It should not be necessary there, and the manual does not state that the driver needs to prime the endpoint on TD completion. Probably, this prime code never runs because 

line 737: if (0U == (ehciState->registerBase->EPSR & primeBit))

always evaluates to false. This would explain the very short runtime of 240 ns.

2) Normally endpoint prime takes place when a new TD is attached to the endpoint. This is the prime action I was referring to in my original post. In your code this would be in USB_DeviceEhciTransfer() in line 1071:

pastedImage_3.png

It would be interesting to see how long the code between line 1069 and 1089 runs. 

3) Note that generally logic in USB_DeviceEhciInterruptTokenDone() like this 

pastedImage_2.png

is not race condition free. Consider the case where the TD was not completed yet when you execute the if statement, but DMA completes it just when you got past the if and execute line 722. Consequently, it's not clear under which condition the write to EPPRIME happens. In most cases it happens if the ACTIVE bit is still set in the TD, but sometimes it happens if the ACTIVE bit is clear. It depends on runtime of your driver code in relation to DMA timing which in turn depends on USB host timing. This is a classical race condition which may lead to unpredictable behavior.

Udo

0 Kudos

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

I did a test in USB_DeviceEhciTransfer() with below toggle interval.

The time interval is about 1.2us. 

pastedImage_1.png

pastedImage_2.png

best regards,

Mike

0 Kudos

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

Thanks for the reply.

1> I didn't make usb composite device of CDC and Audio working so far. There still with compile issue. I am still in progress to make it works.

2> In parallel, I am double checking with local experienced engineer about your issue. I will check any updated idea/comments about this issue. I will update later. Thanks for your patience.

Thanks for the patience.

best regards,

Mike

0 Kudos

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

If you are using modified MCUXpresso SDK USB composite example of RT106x for the test?

If yes, could you help to share the test code?

Then I will double check with local i.MXRT product team to find the root cause.

Thanks.

best regards,

Mike 

0 Kudos

2,413 Views
udoeb
Contributor II

Hi,

Thanks for your reply. We don't use USB code from the SDK. We do have our own USB device stack implementation. I cannot share this code, sorry.

Meanwhile our analysis clearly shows that the issue is not caused by our code. It's caused by USB DMA behavior. High data throughput on bulk endpoints keeps the DMA busy so that it cannot serve isoch endpoints in time. Specifically, the problem occurs if we link TDs with 4KB data to the bulk IN endpoint. The DMA seems to keep loading TX FIFO and serving IN tokens for the bulk IN EP until this TD is exhausted, and does not serve the isoch IN EP in between.

We can think of two possible root causes for this:

a) The USB_OTG controller design has a flaw and allows bulk EPs to monopolize the DMA. We tried setting the SDIS (stream disable) bit in the USBMODE register. This resolves the issue with the 4KB TD but causes other issues (I can provide more details but probably this would be another thread...).

b) DMA performance and/or arbitration behavior is poor. This could be caused by the way the USB controller is integrated in IMXRT.

Hence my questions:

Q1) How is the USBOTG1 controller connected to the internal bus fabric? The reference manual does not have any details on this.

Q2) Is it possible to configure arbitration priority for the USBOTG1 bus master?

Udo

0 Kudos

2,413 Views
udoeb
Contributor II

Hi again,

Probably, you can duplicate the issue by using USB code from the SDK. However, I don't think it includes a composite example which combines bulk and isoch, You will need to create one with CDC/ACM (bulk IN/OUT) and Audio 2.0 (with isoch IN.). 

Make sure the following conditions are met:

- The isoch IN endpoint is set to bInterval=1 (125us packet interval).

- The TDs queued to bulk IN and OUT endpoints are at least 2KB in size.

If you transfer a contiguous data stream through the bulk endpoints, e.g. by feeding the bulk In EP with dummy data and discarding the bulk OUT data, you should be able to observe that the isoch IN EP is not able to answer every IN token it receives from the host (at 125us interval). So there will be gaps in the audio stream.

Also, if you can prepare such a composite example, and can provide it, we can help with finding a test setup that exhibits the issue. We are not familiar with USB drivers from the SDK. Hence, creating the sample code on our side would be quite an effort 

Udo

0 Kudos

2,413 Views
Hui_Ma
NXP TechSupport
NXP TechSupport

Hi Udo,

Sorry for the later reply.

I need to double check with RT product team about this issue.

I will update when there with any feedback.

Thanks for the patience.

best regards,

Mike

0 Kudos