[t1040d4rdb-64b][dpaa] "Err FD status = 0x00040000" log and kernel panic when set nic "duplex half"

lzhou · ‎12-10-2020

Hi,

When we work on the board t1040d4rdb, set the nic as duplex half:

e.g. ethtool -s eth2 speed 10 duplex half

Then sftp to the board from another pc, send a file to the board's usb card, such as:

put ./test.img

If the file is big enough (e.g. 5M), there are always several errors listed as:

fsl_dpa fsl,dpaa:ethernet@4 eth2: Err FD status = 0x00040000

If the kernel on the board starts from a kexec, e.g.

kexec -l kernel --initrd=rootfs.cpio.gz --append "root=/dev/ram rw console=ttyS0,115200 no_console_suspend ip=dhcp"
kexec -e

The same test usually gets a panic as below (still with above "Err FD status" logs):

root@localhost:~# [ 331.561442] Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6c33
[ 331.561505] Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6c33
[ 331.561508] Faulting instruction address: 0xc000000000623a78
[ 331.561514] Oops: Kernel access of bad area, sig: 11 1
[ 331.561518] SMP NR_CPUS=4 CoreNet Generic
[ 331.561523] Modules linked in: mpc85xx_edac edac_core
[ 331.561530] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.1.21-rt13-WR8.0.0.32_preempt-rt #1
[ 331.561535] task: c0000001f22b0000 ti: c0000001fffac000 task.ti: c0000001f22b8000
[ 331.561540] **bleep**: c000000000623a78 LR: c000000000623a4c CTR: c000000000447b00
[ 331.561544] REGS: c0000001fffaf730 TRAP: 0300 Not tainted (4.1.21-rt13-WR8.0.0.32_preempt-rt)
[ 331.561553] MSR: 0000000080029000 <CE,EE,ME> CR: 88000224 XER: 00000000
[ 331.561574] DEAR: 6b6b6b6b6b6b6c33 ESR: 0000000000000000 SOFTE: 1
[ 331.561574] GPR00: c000000000622850 c0000001fffaf9b0 c000000000ec0a00 c0000001e91cf7d8
[ 331.561574] GPR04: 00000001e9d869fa 0000000000002780
[ 331.561575] Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6c33
[ 331.561579] c0000001e9d869fa
[ 331.561579] Faulting instruction address: 0xc000000000623a78
[ 331.561623] 0000000000000000
[ 331.561623] GPR08: 00000001ff150000 0000000000000000 8000000000000000 00000001ff150000
[ 331.561623] GPR12: 7265677368657265 c00000000fffed80 0000000000000000 c000000000d362c0
[ 331.561623] GPR16: 0000000100001ebe 000000000000000a c000000000d362c0 c000000000d30510
[ 331.561623] GPR20: 000000000000012c 0000000000000001 0000000000000000 c000000000ecb180
[ 331.561623] GPR24: 0000000000000001 ffffffffffffff80 c000000000000000 6b6b6b6b6b6b6b6b
[ 331.561623] GPR28: c0000001f2621820 c0000001e9215a80 04000036d000011d 00000001e9d869fa
[ 331.561636] **bleep** [c000000000623a78] ._dpa_cleanup_tx_fd+0xb8/0x350
[ 331.561642] LR [c000000000623a4c] ._dpa_cleanup_tx_fd+0x8c/0x350
[ 331.561643] Call Trace:
[ 331.561650] [c0000001fffaf9b0] [c0000001ffe90d1c] 0xc0000001ffe90d1c (unreliable)
[ 331.561658] [c0000001fffafa70] [c000000000622850] .priv_tx_conf_default_dqrr+0xa0/0x270
[ 331.561667] [c0000001fffafb10] [c0000000006bcf20] .qman_p_poll_dqrr+0x1c0/0x2b0
[ 331.561674] [c0000001fffafbe0] [c000000000622274] .dpaa_eth_poll+0x34/0x90
[ 331.561682] [c0000001fffafc70] [c00000000070b820] .net_rx_action+0x240/0x3e0
[ 331.561692] [c0000001fffafd70] [c00000000004f06c] .__do_softirq+0x14c/0x480
[ 331.561699] [c0000001fffafe90] [c00000000004fc44] .irq_exit+0x94/0xb0
[ 331.561709] [c0000001fffaff00] [c000000000005844] .__do_irq+0xa4/0x1f0
[ 331.561716] [c0000001fffaff90] [c000000000016a40] .call_do_irq+0x14/0x24
[ 331.561722] [c0000001f22bb9d0] [c000000000005a20] .do_IRQ+0x90/0x140
[ 331.561731] [c0000001f22bba70] [c00000000001b888] exc_0x500_common+0xd8/0xdc

Per my checking, the panic is caused by that the skb address got from the frame descriptor is 0x6b6b6b6b6b6b6b6b, which isn't valid.

This only happens on duplex half. It is ok with duplex full, or small file with duplex half.

Could you please give us some help about this? Thank you.

lzhou · ‎12-14-2020

Thank you for your reply. I have ever tested this on QorIQ SDK 2.0.

I don't test about the panic on the SDK because the test involves a reboot using kexec. More work is needed for setting up this test on the SDK.

About the repeated errors "Err FD status = 0x00040000", nothing is printed out in SDK2.0 test because of this new commit <dpaa_eth: do not print debug message as error> in SDK2.0. The same status exists because that the log appears again without this commit.

My question is:
The panic is caused by that the skb address in _dpa_cleanup_tx_fd() got from the frame descriptor is 0x6b6b6b6b6b6b6b6b, which isn't valid.
The frame descriptor is in the queue manager (QMan). QMan has reported several errors as "Err FD status = 0x00040000" from hardware about frame descriptor just before the panic. I highly doubt that all those issues are related with the hardware error status in QMan or dpa ethernet driver.
Could you please help check about why the error status just happen with duplex half when receiving big files?

Thanks.

yipingwang · ‎12-17-2020

In T1040RM, it has:

#####

6.5.4 Half Duplex Operation

The MAC can support half-duplex operation for MII and GMII modes of

operation. The Half-Duplex

function does not implement the Gigabit Half duplex algorithms (burst,

carrier extension), but only

supports half-duplex for 10/100 networks.

#####

Also, all DPAA-FMAN-dTSEC/mEMAC devices will support 10/100 Mbps Full-Duplex

but 10/100 Mbps

Half-Duplex is only supported on devices with a physical MII interface (COL

& CRS signals routed externally)

- ie., T1040.

Although interfaces like RGMII and SGMII specify 10/100 Mbps Half-Duplex

operation (via coding methods),

NXP has not implemented Half-Duplex support for these interfaces.

Can you confirm interface you are having problem, RGMII or GMII?

In the error case, can you check the mEMAC register

Transmit Late Collision Counter Register (TLCOLn)

The Frame corrupted / discarded (half-duplex only)

One more note. from the T1040RM, it has:

Table 31-2. MAC Capabilities

3. For FMan, half duplex is supported for 10/100 Mbps SGMII, but some

collisons may be recorded as late-collisons.

lzhou · ‎12-17-2020

My PHY connection type is "rgmii" for the interface that the error occurs. Then can I come to the conclusion that NXP has not implemented Half-Duplex (10M/100M speed) support for it? Thanks.

yipingwang · ‎12-20-2020

Here is what I found in our documentation.

FMAN based devices will only support Half-Duplex if the SoC supports MII or

GMII. All other interfaces are only Full-Duplex.

So if customer wants 10/100 half duplex, please connect to GMII.

lzhou · ‎01-21-2021

Hi, Yiping:

Our customer meets panics easily after auto negotiate to half duplex. We think that it is related with that they use sgmii phy connection. Is there any way that we can use to disable auto negotiate only for duplex? Thanks.

lzhou · ‎12-23-2020

Hi, Yiping:

Is it possible that you share with us about the document that talk about this point? We met some difficulty when try to convince our customer about this. Thank you.

yipingwang · ‎12-13-2020

According to your log, it seems that you didn't use Linux Kernel from NXP/Freescale formal released SDK.

I uploaded uImage and dtb file from QorIQ SDK 2.0 release to the following link, please use them to do verification.

https://drive.google.com/file/d/1m3N8VjyBpnySrPnxoLdTfcSQf56_zV1L/view?usp=sharing