How does the DPDK sender exit the congested state

liushuangxin · ‎08-10-2023

Hello nxp, we use 2160 as CPU,LSDK 20.4,linux 5.4.3,dpdk 19.11.

We use dpmac4 as a 10G port and bind it as a dpdk driver to implement the protocol stack in user mode. We have a problem in using the dpaa2 driver packet sending interface in dpdk. After the peer device of the 10G port restarts repeatedly, the dpaa2_dev_tx function in dpdk will enter the abnormal branch, as shown below.

uint16_t
dpaa2_dev_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
{
...
/*Check if the queue is congested*/
retry_count = 0;
while (qbman_result_SCN_state(dpaa2_q->cscn)) {
retry_count++;
/* Retry for some time before giving up */
if (retry_count > CONG_RETRY_COUNT)
goto skip_tx;
}
...
}

Thus skipping the send. I went to check the relevant documents of dpaa, and found that the queue entered the crowded state, which led to the abnormal branch sending failed.

The following description is provided in the document：
4.2.14.6 Congestion State Change Notifications (CSCN)
In addition to accepting or rejecting enqueues based on congestion, QMan is able to
notify certain producers when a congestion group's instantaneous count (I_CNT field in
the CG) exceeds the CS entrance threshold. When this threshold is exceeded, the CS bit
is set in the CGR, and the congestion group is said to have entered congestion. When the
group's I_CNT returns below the exit threshold (CS_THRES_X), the CS bit is cleared,
and the congestion group's state exits congestion.

I would like to ask how we can get the instantaneous count (I_CNT) value and why we can't exit the queue congestion.

yipingwang · ‎08-22-2023

I checked customer's log, the error message "[W, DPNI] ID[0]: tail drop enabled may have conflicts with flow control settings" is caused by customer repeat restarts the 10G port repeatedly, I don't think this warning caused this issue.

I also found some errors in customer' log:
1. Pls ask customer review their DPL file. MC firmware skip processing the DPL file due to illegal configuration detected.
[E, open_device:2374, RESMAN] no device
[E, dpcon_probe_cb:554] Can't create DPCON 0000
[E, subnode_process:155] Probing module 'dpcon' return error code -19. Continue dpl processing...
[E, dpl_process:524] Error while parsing 'connections'. Skip processing the rest of DPL.
[E, main:186] DPL processing failed; continuing...

2. Internal physical link is down happened on DPMAC[13] and DPMAC[14], I think this issue is caused by "internal physical down" occurred and tx port stop to dequeue frames from Tx queue. After this failure happened, more and more frames pending in Tx queue and caused Tx congestion.

So I suggest customer check these items:
1. DPL file. Fix the error configuration until MC not report "Skip processing the rest of DPL".
2. If customer using fiber port in their board, they can try un-plug the fiber cable during the LX2 Tx/Rx traffic to check if can reproduce this failure.

View solution in original post

yipingwang · ‎08-21-2023

Any update?

liushuangxin · ‎08-21-2023

I have attached some serial port information, can you help analyze it, or provide some debug ideas? Thank you.

yipingwang · ‎08-22-2023

I checked customer's log, the error message "[W, DPNI] ID[0]: tail drop enabled may have conflicts with flow control settings" is caused by customer repeat restarts the 10G port repeatedly, I don't think this warning caused this issue.

I also found some errors in customer' log:
1. Pls ask customer review their DPL file. MC firmware skip processing the DPL file due to illegal configuration detected.
[E, open_device:2374, RESMAN] no device
[E, dpcon_probe_cb:554] Can't create DPCON 0000
[E, subnode_process:155] Probing module 'dpcon' return error code -19. Continue dpl processing...
[E, dpl_process:524] Error while parsing 'connections'. Skip processing the rest of DPL.
[E, main:186] DPL processing failed; continuing...

2. Internal physical link is down happened on DPMAC[13] and DPMAC[14], I think this issue is caused by "internal physical down" occurred and tx port stop to dequeue frames from Tx queue. After this failure happened, more and more frames pending in Tx queue and caused Tx congestion.

So I suggest customer check these items:
1. DPL file. Fix the error configuration until MC not report "Skip processing the rest of DPL".
2. If customer using fiber port in their board, they can try un-plug the fiber cable during the LX2 Tx/Rx traffic to check if can reproduce this failure.

liushuangxin · ‎09-14-2023

Hi, yiping
Thank you for your suggestion. Now the problem has been solved. After modifying it according to the method you provided, and I upgraded dpdk to 19.11.6, mc 10.29.x. If you have similar problems can be used as a reference, I wish you a happy life.

yipingwang · ‎08-17-2023

I have one question, what is interface mode of the 10G port? After 10G port restart, does customer found PCS link down event report by MC firmware?

Pls provide MC log from MC console:

cat /dev/fsl_mc_console

liushuangxin · ‎08-21-2023

I am sorry that I did not reply you in time, because the problem needs to be repeated for a long time. The 10G port mode is XFI. From the restool tool, the link status is up, but the receive count is growing and the send is not. I checked mc_console and found a lot of prints, "[W, DPNI] ID[0]: tail drop enabled may have conflicts with flow control settings ", I wonder if this will have any impact on this problem. I have attached some serial port information, can you help analyze it, or provide some debug ideas? Thank you.