Gianfar skb errors

rro · ‎08-10-2020

Hello,

CPU: e300c1, MPC8343A, Rev: 3.0

I am experience kernel panics using Kernel 4.19.87 related to, what appears to be, socket buffer corruption. A few different crashes occur upon traffic being sent/received. Sometimes these occur instantly, others up to a minute or so after everything is up and configured.

1:

queue_mapping=1 skbaddr=cf929000 protocol=0x0800 ip_summed=0 len=303 data_len=0 network_offset=-78 transport_offset_valid=0 transport_offset=65457 tx_flags=43 gso_size=5 gso_segs=2 gso_type=0x0

[  125.564168] Unable to handle kernel paging request for data at address 0x81000000
[  125.571671] Faulting instruction address: 0xc03c42ec
[  125.576656] Oops: Kernel access of bad area, sig: 11 [#1]
[  125.582065] BE PREEMPT eMPC
[  125.584887] CPU: 0 PID: 0 Comm: swapper Not tainted kernel_upgrade #5
[  125.594567] NIP:  c03c42ec LR: c03c3f60 CTR: c02a6084
[  125.599636] REGS: cfff5ca0 TRAP: 0300   Not tainted  (kernel_upgrade )
[  125.609310] MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 24028224  XER: 20000000
[  125.616054] DAR: 81000000 DSISR: 20000000
[  125.616054] GPR00: c03c3f60 cfff5d50 c067b420 0000001c d101c501 00000000 00000aac c06f1c1b
[  125.616054] GPR08: c06cac48 00000000 cfff4000 cfff5d10 44028224 00900000 cf97c878 00480020
[  125.616054] GPR16: 00480020 c0555a98 cf9290a8 00000000 c0607325 c0607313 c0555aa4 00000800
[  125.616054] GPR24: 00000001 cfff5e28 c0555874 c0555850 c0555810 81000000 cf929000 c0607325
[  125.653754] NIP [c03c42ec] skb_copy_ubufs+0x484/0x4d4
[  125.658827] LR [c03c3f60] skb_copy_ubufs+0xf8/0x4d4
[  125.663714] Call Trace:
[  125.666177] [cfff5d50] [c03c3f60] skb_copy_ubufs+0xf8/0x4d4 (unreliable)
[  125.672916] [cfff5da0] [c03d29a4] __netif_receive_skb_core+0x9c0/0xbd8
[  125.679472] [cfff5e20] [c03d2bf0] __netif_receive_skb_one_core+0x34/0x60
[  125.686208] [cfff5e40] [c03d77f0] netif_receive_skb_internal+0x7c/0xec
[  125.692766] [cfff5e50] [c03d8a24] napi_gro_receive+0xf8/0x124
[  125.698547] [cfff5e70] [c0338610] gfar_clean_rx_ring+0x640/0x674
[  125.704581] [cfff5f00] [c0338808] gfar_poll_rx_sq+0x48/0xdc
[  125.710180] [cfff5f20] [c03d92b0] net_rx_action+0x12c/0x308
[  125.715788] [cfff5f80] [c051b6b0] __do_softirq+0x230/0x32c
[  125.721313] [cfff5fe0] [c00237c0] irq_exit+0x80/0xa0
[  125.726309] [cfff5ff0] [c000e3a4] call_do_irq+0x24/0x3c
[  125.731566] [c06cde80] [c00069c0] do_IRQ+0xb8/0xe0‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

2:

queue_mapping=1 skbaddr=cf9290c0 protocol=0x0800 ip_summed=0 len=78 data_len=0 network_offset=-78 transport_offset_valid=0 transport_offset=65457 tx_flags=2 gso_size=0 gso_segs=0 gso_type=0x0

[ 63.970458] Unable to handle kernel paging request for data at address 0x008c001a
[ 63.977966] Faulting instruction address: 0xc03c32a8
[ 63.982951] Oops: Kernel access of bad area, sig: 11 [#1]
[ 63.988360] BE PREEMPT eMPC
[ 63.991183] CPU: 0 PID: 0 Comm: swapper Not tainted kernel_upgrade #9
[ 64.000862] NIP: c03c32a8 LR: c03c2fb0 CTR: 00000000
[ 64.005933] REGS: cfff5c30 TRAP: 0300 Not tainted (kernel_upgrade)
[ 64.015606] MSR: 00009032 <EE,ME,IR,DR,RI> CR: 44088224 XER: 00000000
[ 64.022350] DAR: 008c001a DSISR: 20000000
[ 64.022350] GPR00: c03c2fb0 cfff5ce0 c067b420 008c001a 000000ff fedac247 41434143 04010000
[ 64.022350] GPR08: 00000000 00000000 00000000 cfff5d00 44044224 00900000 cf97c878 00480020
[ 64.022350] GPR16: 00000054 00000000 00000000 ce30c078 c05c3782 00000003 c0a80019 c0a800ff
[ 64.022350] GPR24: 00000089 c06d3fbc 00000000 c06ba938 cec666a8 00000000 cec66680 cf9290c0
[ 64.060047] NIP [c03c32a8] kfree_skb_list+0x24/0x40
[ 64.064947] LR [c03c2fb0] skb_release_data+0xc8/0x208
[ 64.070009] Call Trace:
[ 64.072474] [cfff5ce0] [c03c59f8] skb_checksum+0x38/0x48 (unreliable)
[ 64.078944] [cfff5cf0] [c03c2fb0] skb_release_data+0xc8/0x208
[ 64.084715] [cfff5d10] [c03c3164] __kfree_skb+0x24/0x3c
[ 64.089973] [cfff5d20] [c044f3d8] __udp4_lib_rcv+0x6b4/0x8a8
[ 64.095665] [cfff5d80] [c041b940] ip_local_deliver_finish+0x118/0x244
[ 64.102132] [cfff5da0] [c041c534] ip_local_deliver+0x68/0xec
[ 64.107816] [cfff5de0] [c041c610] ip_rcv+0x58/0xc0
[ 64.112636] [cfff5e20] [c03d2c98] __netif_receive_skb_one_core+0x58/0x60
[ 64.119372] [cfff5e40] [c03d7874] netif_receive_skb_internal+0x7c/0xec
[ 64.125928] [cfff5e50] [c03d8aa8] napi_gro_receive+0xf8/0x124
[ 64.131707] [cfff5e70] [c033862c] gfar_clean_rx_ring+0x640/0x674
[ 64.137740] [cfff5f00] [c0338824] gfar_poll_rx_sq+0x48/0xdc
[ 64.143339] [cfff5f20] [c03d9334] net_rx_action+0x12c/0x308
[ 64.148945] [cfff5f80] [c051b7c0] __do_softirq+0x230/0x32c
[ 64.154470] [cfff5fe0] [c00237c0] irq_exit+0x80/0xa0
[ 64.159466] [cfff5ff0] [c000e3a4] call_do_irq+0x24/0x3c
[ 64.164721] [c06cde80] [c00069c0] do_IRQ+0xb8/0xe0‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

3:

queue_mapping=1 skbaddr=cf9290c0 protocol=0x0800 ip_summed=0 len=78 data_len=0 network_offset=-78 transport_offset_valid=0 transport_offset=65457 tx_flags=160 gso_size=53246 gso_segs=55488 gso_type=0xcffed940

[ 159.781432] BUG: Bad page state in process swapper pfn:0fd4e
[ 159.787209] page:cffed9c0 count:0 mapcount:0 mapping:cf57562c index:0x1
[ 159.793845] flags: 0x0()
[ 159.796408] raw: 00000000 00000100 00000200 cf57562c 00000001 00000000 ffffffff 00000000
[ 159.804518] page dumped because: non-NULL mapping
[ 159.809246] CPU: 0 PID: 0 Comm: swapper Not tainted kernel_upgrade #5
[ 159.818921] Call Trace:
[ 159.821400] [cfff5c60] [c00ae17c] bad_page+0x118/0x11c (unreliable)
[ 159.827698] [cfff5c80] [c00ae3c0] free_pcppages_bulk+0x1b8/0x440
[ 159.833734] [cfff5ce0] [c00afaec] free_unref_page+0x60/0x6c
[ 159.839340] [cfff5cf0] [c03c2f0c] skb_release_data+0xa8/0x208
[ 159.845111] [cfff5d10] [c03c30e0] __kfree_skb+0x24/0x3c
[ 159.850370] [cfff5d20] [c044f2c4] __udp4_lib_rcv+0x6b4/0x8a8
[ 159.856063] [cfff5d80] [c041b82c] ip_local_deliver_finish+0x118/0x244
[ 159.862532] [cfff5da0] [c041c420] ip_local_deliver+0x68/0xec
[ 159.868216] [cfff5de0] [c041c4fc] ip_rcv+0x58/0xc0‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

There seems to be an issue with how packets are being segmented, almost as if there is an inconsistency in the kernel in regard to involved drivers/subsystems agreeing upon if GSO/TSO is enabled or not.

Thank you in advance for any help or information.

rro · ‎01-30-2024

Hello

We are upgrading our kernel again and this same issue has manifested itself once more...I am reviving this post with some new information

Commenting out the following line resolves the kernel panic and results in a functioning network. I have not performed any perf tests.

dma_sync_single_range_for_cpu(rx_queue->dev, rxb->dma, rxb->page_offset,
GFAR_RXB_TRUESIZE, DMA_FROM_DEVICE);

https://github.com/torvalds/linux/blob/v4.19/drivers/net/ethernet/freescale/gianfar.c#L3002

This is not a solution, just an observation

Do you have any idea what is going on here?

rro · ‎01-31-2024

moving

dma_sync_single_range_for_cpu(rx_queue->dev, rxb->dma, rxb->page_offset,
GFAR_RXB_TRUESIZE, DMA_FROM_DEVICE);

after

if (gfar_add_rx_frag(rxb, lstatus, skb, first)) {

also seems to "solve" the issue

rro · ‎01-31-2024

more new information...

dma_sync_single_range_for_cpu before the following line --> crash

https://github.com/torvalds/linux/blob/v4.19/drivers/net/ethernet/freescale/gianfar.c#L2955

dma_sync_single_range_for_cpu after this line --> ok

for some reason addressing the original offset (1st half of the page) results in the observed kernel panic

rro · ‎02-01-2024

moving the sync above the build_skb() functions also resolves the issue...

Index: drivers/net/ethernet/freescale/gianfar.c
===================================================================
--- drivers/net/ethernet/freescale/gianfar.c	(revision 592823)
+++ drivers/net/ethernet/freescale/gianfar.c	(working copy)
@@ -3047,10 +3047,13 @@
 {
 	struct gfar_rx_buff *rxb = &rx_queue->rx_buff[rx_queue->next_to_clean];
 	struct page *page = rxb->page;
 	bool first = false;
 
+	dma_sync_single_range_for_cpu(rx_queue->dev, rxb->dma, rxb->page_offset,
+				      GFAR_RXB_TRUESIZE, DMA_FROM_DEVICE);
+
 	if (likely(!skb)) {
 		void *buff_addr = page_address(page) + rxb->page_offset;
 
 		skb = build_skb(buff_addr, GFAR_SKBFRAG_SIZE);
 		if (unlikely(!skb)) {
@@ -3059,13 +3062,10 @@
 		}
 		skb_reserve(skb, RXBUF_ALIGNMENT);
 		first = true;
 	}
 
-	dma_sync_single_range_for_cpu(rx_queue->dev, rxb->dma, rxb->page_offset,
-				      GFAR_RXB_TRUESIZE, DMA_FROM_DEVICE);
-

I don't see this change resulting in any unintended consequences or memory leaks as all addresses given to the "dma_sync_single_range_for_cpu() function are the same before and after the build_skb() function...

nevertheless, I am still not sure why this is resulting in the observed kernel panics

rro · ‎02-06-2024

@Pavel can you provide any insight into why the above changes are necessary?

rro · ‎09-15-2020

Hello,

We have narrowed the issue down to the following commit:

https://github.com/torvalds/linux/commit/75354148ce697266b57c13d051ddffa3bb75fc9e

Without these changes, we experience no crashes.

Specifically, we see changes to the tx_flags, gso_size, gso_secs, and gso_type variables in the SKB after the dma_sync_single_range_for_cpu() call in gfar_get_next_rxbuff(). As you can see in the above kernel panics, sometimes these members contain what appear to be valid values, and sometimes corrupted values.

Do you have an idea about why these changes might result in the kernel panics as described above?

Pavel · ‎08-11-2020

Perhaps these pages will be helpful for your problem:

https://community.nxp.com/thread/381758

https://bugzilla.kernel.org/show_bug.cgi?id=19692

https://lore.kernel.org/patchwork/patch/909398/

https://cateee.net/lkddb/web-lkddb/GIANFAR.html

Have a great day,
Pavel Chubakov

Pavel · ‎08-10-2020

Is there problem under u-boot on your board?

NXP offers LTIB Linux BSP for the MPC8349 - MPC8343.

Is there problem if this BSP is used on your board?

Perhaps NXP Professional Service can be helpful for you.

Use the following page for testing and code changing for new kernel verion:

https://www.nxp.com/design/engineering-services/professional-engineering-services:PROFESSIONAL-ENGIN...

Have a great day,
Pavel Chubakov

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

rro · ‎08-11-2020

thank you for your response.

No, there is no issue in the u-boot. The kernel has been upgraded from 3.14 to 4.19, and using the same u-boot, there are no issues under 3.14, only 4.19

Pavel · ‎09-15-2020

Perhaps NXP Professional Service can be helpful for you.

Use the following page for testing and code changing for new kernel verion:

https://www.nxp.com/design/engineering-services/professional-engineering-services:PROFESSIONAL-ENGIN...

Have a great day,
Pavel Chubakov

Gianfar skb errors

Gianfar skb errors

Ethernet