RT117x: ENET_1G performaces

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

RT117x: ENET_1G performaces

491 Views
mastupristi
Senior Contributor I

hello,

I did some tests using the example lwip_iperf_enet_qos_bm. I used that because there is both m7 and m4 version. I don't know then what changes from the non-qos version.

Some data:
MCUXpresso IDE v24.9 [Build 25] [2024-09-26]
SDK 2.16.100
board RT1170-EVKB

My PC is:
cpu: AMD Ryzen Threadripper 2990WX 3.0GHz (4.3MHz turbo)
mem: 32GB DDR4
net: Intel I211 Gigabit Network
SO: xubuntu linux 24.04
iperf: 2.1.9

To test the PC NIC performance I tried iperf with another PC. So the performaces of my NIC when communicating with another PC are (performance is in Mbit/s):

mastupristi_0-1731772742982.png

To explain: on my PC I ran the server and on the client I used tradeoff (simultaneous bidirectional which I indicated COMBO) and dualtest (which runs RX and TX tests sequentially) modes. I verified that the dualtest mode achieves the same performance as the stand alone rx and tx modes.

On the board I ran m7 in flash and ram and the cm only in ram because the example I could not run in flash. Find all the examples attached (even the m4 flash one).

the results are as follows:

mastupristi_1-1731772765691.png

To explain: RX and TX TCP I calculated them using item 4 of the menu on the board, COMBO TCP item 3.

For similar UDP using items 9 and 8.

I'm not sure about the COMBO data, but about TX and RX I think they are reliable and the numbers I think are pretty much in line with what we should expect from a 1G eth. The only doubt is about UDP on M7 where TX is much lower than RX and much lower than TX on M4

There are the attached examples. Can you confirm that you get comparable numbers to mine?

Can you give an explanation why UDP in TX is slower on M7 than on M4? And why on M7 UDP TX is 13 times slower than RX?

Is this performance the maximum achievable by the combination of this peripheral + bus + core, or is the bottleneck the SW and so working on it could achieve more? And how much more? A few or many percentage points?

Note that the RAM used in the attached examples is exclusively TCM.

 

regards

Max

Tags (4)
0 Kudos
Reply
7 Replies

397 Views
EdwinHz
NXP TechSupport
NXP TechSupport

Hi @mastupristi,

From your projects attached, it is worth noting that you have enabled SW checksum, which has a high impact on performance.

Please try the iperf performance using the default configuration of the latest SDK (2.16.100) to better understand the results.

BR,
Edwin.

0 Kudos
Reply

381 Views
mastupristi
Senior Contributor I

Hi @EdwinHz ,

thank you for your reply. However, I think there might have been some misunderstanding.

To begin with, as I clearly stated in my original message, I used the example from SDK 2.16.100 ("lwip_iperf_enet_qos_bm"). Specifically, I made minimal changes to ensure my tests ran as intended and even attached the projects to my post for your convenience. I am curious: are you suggesting that in the official SDK example, hardware checksum offloading is disabled by default? If that is the case, could you kindly clarify where exactly this setting is configured?

That said, I must point out that the main focus of my inquiry wasn’t about how to increase performance (though that could be interesting too). My main concerns were:

  1. The significant performance disparity I observed in certain scenarios, particularly UDP TX on M7 versus M4. Why does the M7 perform 13x worse than RX, and why does M7 TX lag so much behind M4?
  2. Theoretical performance limits of this combination of peripheral, bus, and core. Are the results I obtained close to the hardware's best-case scenario, or is there room for improvement by tweaking the code or using peripherals differently? And if there is room for improvement, would it be worth pursuing (e.g., are we talking about single-digit or double-digit percentage gains)?

While I appreciate the suggestion about hardware checksum offloading, it doesn't seem to address my specific concerns about performance disparities or theoretical limits. If you still believe it’s relevant, could you please explain how it might account for the discrepancies I observed?

Looking forward to your insights.

Best regards,
Max

0 Kudos
Reply

378 Views
EdwinHz
NXP TechSupport
NXP TechSupport

Hi @mastupristi,

Thanks for your clarifications. The thing is, this disparity that you are talking about is not aligned with the results we have previously gotten. There are the previous results we see for CM7:

CM7
Iperf UDP TX 954 Mbps
Iperf UDP RX 905 Mbps

 

There seems to be something wrong with the Iperf UDP RX on your tests. This is why I mentioned the previous suggestion to disable the checksum.

This can be done under Project Properties > C/C++ Build > Settings > Tool Settings > Preprocessor:

EdwinHz_0-1732730334211.png

BR,
Edwin.

 

0 Kudos
Reply

364 Views
mastupristi
Senior Contributor I

Hi @EdwinHz ,

thank you for your response, but I must admit I’m rather baffled.

First, you suggested I modify the preprocessor settings, pointing to a potential checksum issue. However, as I’ve stated (and as you could easily verify by opening the projects I attached to my original message), the settings in my projects are identical to the ones in the SDK example. Indeed, it could not be otherwise since my projects are almost identical to the SDK project from which I derived them. Did you actually check the projects? Or are you basing your suggestion solely on assumptions?

Here a screenshot:

mastupristi_0-1732789181078.png

 

Second, you referenced “previous results” showing 954 Mbps for UDP TX and 905 Mbps for UDP RX on CM7. This is intriguing, as it’s the first time I’ve heard about these numbers. Could you please provide more details? For instance:

  1. What was the exact configuration of the test setup (SDK version, board, core, flash/TCM, etc.)?
  2. Were the results obtained with a release build or a debug build?
  3. Were hardware checksum offloading and other relevant optimizations enabled in those tests?

Finally, I feel like my core questions are still being sidestepped. Let me reiterate:

  1. Why does UDP TX on M7 perform so much worse than UDP RX in my tests?
  2. Why does M7 UDP TX underperform compared to M4?
  3. Are the results I obtained close to the theoretical maximum achievable by the hardware, or can they be improved further? If improvements are possible, are we talking about minor gains (e.g., 1–2%) or significant ones (e.g., >10%)?

These are critical questions for evaluating whether further development effort is worthwhile, yet I feel they’ve been overlooked in favor of generic suggestions.

I urge you to review the attached projects thoroughly before suggesting changes again. If something in my setup deviates from your “previous results,” I’d appreciate precise feedback based on actual analysis. Otherwise, this back-and-forth isn’t helping us reach a meaningful conclusion.

Looking forward to receiving a detailed and well-informed response.

Best regards,
Max

0 Kudos
Reply

352 Views
EdwinHz
NXP TechSupport
NXP TechSupport

Hi @mastupristi,

It seems like the issue might lie on the SDK's drivers. These are the performance tests done in both SDK v2.16.100 and 2.16.000:

jia_guo_0-1732863579076.png

As you can see, performance in the previous 2.16.0 is much higher and in line with the expected results. This is probably the issue you are seeing, and what is causing both bad performance on UDP TX, as well as CM7's underperformance compares to CM4.

I will investigate this issue further with the internal SDK team, but in the meantime, if you wish to execute without those issues, please use SDK v2.16.0 instead.

BR,
Edwin.

 

0 Kudos
Reply

313 Views
mastupristi
Senior Contributor I

Hi @EdwinHz 

just to see things from another point of view

mastupristi_0-1733178953363.png

this directly relates SDK 2.16.0 and 2.16.1
you say:

As you can see, performance in the previous 2.16.0 is much higher and in line with the expected results.

this is only true for ENET 1G.

This is probably the issue you are seeing, and what is causing both bad performance on UDP TX, as well as CM7's underperformance compares to CM4.

How can you be sure of this statement?
What clues lead you to this conclusion?

Seeing the data I pick up conflicting clues: in my case (UDP over CM7 ENET QoS Flash opt -O3) RX 13 times faster than TX, in your case RX is 1.29 times faster than TX.

Also, you get much better performance with ENET 1G but I also get much better performance with QoS, whereas you have a very big difference between the two ENETs with other conditions being equal? Does this depend on the peripherals or the driver?

mastupristi_1-1733180055985.png

In conclusion, there are still several points that need to be clarified.

I find it hard to think of (only) problems with version 2.16.1

 

best regards

Max

 

0 Kudos
Reply

283 Views
EdwinHz
NXP TechSupport
NXP TechSupport

Hi @mastupristi,

How can you be sure of this statement?
What clues lead you to this conclusion?

The chart that I previously shared with you show that the appropriate performances for ENET 1G and ENET QoS on the default example code for both SDK 2.16.0 and 2.16.1, and they do not have a low TX like you showed on your initial chart.

Does this depend on the peripherals or the driver?

This would have to be dependent on the peripherals, yes.

BR,
Edwin.

0 Kudos
Reply