Page allocation failure when iperf through 5G module

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Page allocation failure when iperf through 5G module

3,361 Views
Sean_Lin_Askey
Contributor I

Dear all,

I want to test 5G downlink UDP throughput by iperf3, but error messages always appears.

Page allocation failure always happened when testing over 950Mbps bandwidth.

It looks like xhci's problem. 

I've tried to increase min_free_kbytes but nothing changed. The default value is 22528.

 

In order to get higher throughput, I modified the rps_cpus, xps_cpus and smp_affinity and used ECM dialer.

systemctl disable irqbalanced.service
echo 8 > /sys/class/net/usb0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/usb0/queues/tx-0/xps_cpus
echo 2 > /proc/irq/52/smp_affinity   //IMX8-WU 271 Edge 5b110000.usb3
udhcpc -i usb0

root@ctx0800-c0:~# lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 05c6:9106 Qualcomm, Inc.
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub

 

Kernel version: Linux version 4.14.98+g5d3f4fe  

Please check the full log in attachment.

Connecting to host 172.22.1.201, port 5201
Reverse mode, remote host 172.22.1.201 is sending
[  5] local 192.168.225.57 port 39004 connected to 172.22.1.201 port 5201
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   121 MBytes  1.02 Gbits/sec  0.013 ms  0/87710 (0%)  
[  5]   1.00-2.00   sec   136 MBytes  1.14 Gbits/sec  0.005 ms  0/98432 (0%)  
[  5]   2.00-3.00   sec  26.6 MBytes   223 Mbits/sec  0.006 ms  0/19251 (0%)  
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  0.006 ms  0/0 (0%)  
[  584.663946] swapper/1: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)
[  584.673118] swapper/1 cpuset=/ mems_allowed=0
[  584.677492] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.14.98+g58921c6 #1
[  584.685502] Hardware name: Freescale i.MX8QXP MEK (DT)
[  584.690646] Call trace:
[  584.693109] [<ffff000008089c48>] dump_backtrace+0x0/0x3c8
[  584.698523] [<ffff00000808a024>] show_stack+0x14/0x20
[  584.703587] [<ffff000008da41c0>] dump_stack+0x9c/0xbc
[  584.708643] [<ffff000008196178>] warn_alloc+0xe8/0x180
[  584.713786] [<ffff000008196d60>] __alloc_pages_nodemask+0xae8/0xb38
[  584.720056] [<ffff000008196f00>] page_frag_alloc+0x150/0x170
[  584.725731] [<ffff000008b771b0>] __netdev_alloc_skb+0xb0/0x138
[  584.731581] [<ffff00000885b264>] rx_submit+0x44/0x260
[  584.736640] [<ffff00000885b704>] rx_complete+0x1fc/0x228
[  584.741964] [<ffff00000886aea0>] __usb_hcd_giveback_urb+0x60/0xf0
[  584.748065] [<ffff00000886b0e8>] usb_hcd_giveback_urb+0xe8/0x120
[  584.754079] [<ffff0000088c8f5c>] xhci_giveback_urb_in_irq.isra.25+0x84/0xb0
[  584.761043] [<ffff0000088c9160>] xhci_td_cleanup+0xd0/0x118
[  584.766624] [<ffff0000088ccd90>] finish_td.isra.46+0xe8/0x120
[  584.772374] [<ffff0000088cd278>] xhci_irq+0x4b0/0x1528
[  584.777517] [<ffff00000886abbc>] usb_hcd_irq+0x2c/0x48
[  584.782661] [<ffff0000088ab908>] cdns3_host_irq+0x28/0x40
[  584.788063] [<ffff0000088a6370>] cdns3_irq+0xa0/0xa8
[  584.793035] [<ffff00000811ca74>] __handle_irq_event_percpu+0x5c/0x148
[  584.799488] [<ffff00000811cb7c>] handle_irq_event_percpu+0x1c/0x58
[  584.805681] [<ffff00000811cc00>] handle_irq_event+0x48/0x78
[  584.811262] [<ffff000008120a00>] handle_fasteoi_irq+0xa8/0x180
[  584.817108] [<ffff00000811bb94>] generic_handle_irq+0x24/0x38
[  584.822867] [<ffff00000811c214>] __handle_domain_irq+0x5c/0xb8
[  584.828703] [<ffff000008081960>] gic_handle_irq+0x78/0x174
[  584.834193] Exception stack(0xffff00000800bc50 to 0xffff00000800bd90)
[  584.840640] bc40:                                   0000000000000102 ffff800078392880
[  584.848474] bc60: 0000000000000000 0000000000000000 000000007459557a 0000000064172a21
[  584.856312] bc80: 0000000000000001 00000000c90116ac 0000000000000008 0000000000000000
[  584.864146] bca0: 0000000000004788 ffff800075e75c00 ffff00000800bd30 0000000000000004
[  584.871986] bcc0: 0000000000000000 0000000000000018 0000000000000000 0000ffffac47abc0
[  584.879828] bce0: 0000ffffac528a70 ffff80007df6ee80 ffff0000094bbe80 ffff80007df6efcc
[  584.887666] bd00: ffff0000094bb000 0000000000000140 ffff00000800be24 ffff80006799ec00
[  584.895508] bd20: ffff80007df6efb8 000000000000c8bd 0000000000000000 ffff00000800bd90
[  584.903347] bd40: ffff000008b87390 ffff00000800bd90 ffff000008b87394 0000000080000145
[  584.911189] bd60: 0000000000000000 0000000000000000 0000ffffffffffff ffff000008b8e290
[  584.919025] bd80: ffff00000800bd90 ffff000008b87394
[  584.923909] [<ffff000008083230>] el1_irq+0xb0/0x124
[  584.928793] [<ffff000008b87394>] enqueue_to_backlog+0x124/0x240
[  584.934718] [<ffff000008b8e2b8>] netif_rx_internal+0x100/0x1a8
[  584.940554] [<ffff000008b8e39c>] netif_rx+0xc/0x18
[  584.945353] [<ffff00000885a5e0>] usbnet_skb_return+0x68/0xb0
[  584.951022] [<ffff00000885b884>] usbnet_bh+0x154/0x238
[  584.956168] [<ffff0000080d3a2c>] tasklet_action+0x6c/0x108
[  584.961664] [<ffff000008081b8c>] __do_softirq+0x12c/0x228
[  584.967069] [<ffff0000080d3540>] irq_exit+0xc8/0x100
[  584.972036] [<ffff00000811c218>] __handle_domain_irq+0x60/0xb8
[  584.977875] [<ffff000008081960>] gic_handle_irq+0x78/0x174
[  584.983364] Exception stack(0xffff000009aebe20 to 0xffff000009aebf60)
[  584.989813] be20: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
[  584.997658] be40: ffff0000094b4388 ffff000009aebf50 0000800074a83000 16ebdf868f31edd4
[  585.005501] be60: 0000000000000002 ffff0000094c8580 0000000000000980 0000000000000001
[  585.013339] be80: 000000737920ae80 0000000000000000 0000000000000000 0000000000000018
[  585.021182] bea0: 0000000000000000 0000ffffac47abc0 0000ffffac528a70 ffff0000094ae018
[  585.029020] bec0: ffff0000094ca000 ffff0000094ca000 ffff0000094b9cc0 ffff0000094ca31c
[  585.036864] bee0: 0000000000000000 0000000000000000 ffff800078392880 0000000000000000
[  585.044701] bf00: 0000000000000000 ffff000009aebf60 ffff00000808581c ffff000009aebf60
[  585.052545] bf20: ffff000008085820 0000000000000145 0000000000000000 0000000000000000
[  585.060382] bf40: ffffffffffffffff ffff000008142a4c ffff000009aebf60 ffff000008085820
[  585.068225] [<ffff000008083230>] el1_irq+0xb0/0x124
[  585.073108] [<ffff000008085820>] arch_cpu_idle+0x10/0x18
[  585.078426] [<ffff00000810d8a0>] do_idle+0x120/0x1e0
[  585.083392] [<ffff00000810dafc>] cpu_startup_entry+0x24/0x28
[  585.089060] [<ffff00000808fac8>] secondary_start_kernel+0x110/0x120
[  585.095329] Mem-Info:
[  585.097616] active_anon:10717 inactive_anon:2112 isolated_anon:0
[  585.097616]  active_file:62 inactive_file:104 isolated_file:0
[  585.097616]  unevictable:729 dirty:0 writeback:0 unstable:0
[  585.097616]  slab_reclaimable:4086 slab_unreclaimable:21377
[  585.097616]  mapped:962 shmem:2310 pagetables:529 bounce:0
[  585.097616]  free:181224 free_pcp:225 free_cma:179237
[  585.131021] Node 0 active_anon:42868kB inactive_anon:8448kB active_file:248kB inactive_file:416kB unevictable:2916kB isolated(anon):0kB isolated(file):0kB mapped:3848kB dirty:0kB writeback:0kB shmem:9240kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 8192kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  585.158580] Node 0 DMA free:724896kB min:22528kB low:28160kB high:33792kB active_anon:42868kB inactive_anon:8448kB active_file:248kB inactive_file:416kB unevictable:2916kB writepending:0kB present:1949696kB managed:1763028kB mlocked:2916kB kernel_stack:3856kB pagetables:2116kB bounce:0kB free_pcp:900kB local_pcp:656kB free_cma:716948kB
[  585.188325] lowmem_reserve[]: 0 0 0
[  585.191827] Node 0 DMA: 198*4kB (UMEHC) 157*8kB (MEHC) 390*16kB (UMHC) 0*32kB 1*64kB (C) 0*128kB 1*256kB (C) 1*512kB (C) 1*1024kB (C) 1*2048kB (C) 0*4096kB 1*8192kB (C) 1*16384kB (C) 21*32768kB (C) = 724896kB
[  585.210423] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  585.218860] 3112 total pagecache pages
[  585.222617] 0 pages in swap cache
[  585.225936] Swap cache stats: add 0, delete 0, find 0/0
[  585.231164] Free swap  = 0kB
[  585.234046] Total swap = 0kB
[  585.236924] 487424 pages RAM
[  585.239808] 0 pages HighMem/MovableOnly
[  585.243648] 46667 pages reserved
[  585.246877] 245760 pages cma reserved
[  585.250583] swapper/1: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)
[  585.259737] swapper/1 cpuset=/ mems_allowed=0
[  585.264114] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.14.98+g58921c6 #1
[  585.272124] Hardware name: Freescale i.MX8QXP MEK (DT)
[  585.277260] Call trace:
[  585.279714] [<ffff000008089c48>] dump_backtrace+0x0/0x3c8
[  585.285117] [<ffff00000808a024>] show_stack+0x14/0x20
[  585.290175] [<ffff000008da41c0>] dump_stack+0x9c/0xbc
[  585.295232] [<ffff000008196178>] warn_alloc+0xe8/0x180
[  585.300379] [<ffff000008196d60>] __alloc_pages_nodemask+0xae8/0xb38
[  5]   4.00-5.01   sec  0.00 Bytes  0.00 bits/sec  0.006 ms  0/0[  585.306653] [<ffff000008196f00>] page_frag_alloc+0x150/0x170
 (0%)  [  585.317956] [<ffff000008b771b0>] __netdev_alloc_skb+0xb0/0x138

 

Thanks.

 

 

0 Kudos
17 Replies

2,950 Views
Sean_Lin_Askey
Contributor I

Update more information

I tried to dump by cat /proc/pagetypeinfo when testing 1.1Gbps throughput as below 

The migrate type of "Unmovable" looks out of pages.

 

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Node 0, zone DMA, type Unmovable 0 0 1 1 1 3 4 1 1 0 0 0 0 0
Node 0, zone DMA, type Movable 55 31 7 4 1 1 0 0 1 0 1 1 0 10
Node 0, zone DMA, type Reclaimable 3 6 0 1 1 1 1 0 1 1 1 1 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 1 40 1 0 1 0 1 1 1 1 0 1 1 21
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 112 344 16 0 480 0

 

 

0 Kudos

3,205 Views
Sean_Lin_Askey
Contributor I

4> Could customer try reproduce this issue on 8QXP MEK board, with or without their 5G module. For example without 5G module , direct connect with PC , by using usb netowrk?

I don't have 8QXP MEK board so I tried usb network on our product.

systemctl disable irqbalanced.service
echo 8 > /sys/class/net/usb0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/usb0/queues/tx-0/xps_cpus
echo 2 > /proc/irq/50/smp_affinity //IMX8-WU 267 Edge      5b0d0000.usb

The result is.. no OOM occurred.
My PC could not send over 1Gbps iperf stream, so I reduced UDP packet size to 128bytes in this case. (Short packet test with 5G module can makes system OOM.)


5> On their board, is there one Ethernet port, if so , do iperf only on Ethernet port, result is ?

systemctl disable irqbalanced.service
echo 8 > /sys/class/net/usb0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/usb0/queues/tx-0/xps_cpus
echo 2 > /proc/irq/112/smp_affinity //IMX8-WU 262 Edge 5b050000.ethernet

Same as question4. Ethernet port test is also good with short packet stream.

0 Kudos

3,226 Views
Sean_Lin_Askey
Contributor I

Dear NXP support,

1> In case not enable below cmd, how about test result.

echo 8 > /sys/class/net/usb0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/usb0/queues/tx-0/xps_cpus
echo 2 > /proc/irq/52/smp_affinity //IMX8-WU 271 Edge 5b110000.usb3

No OOM occurred but throughput is only about 450Mbps.

2> Need confirmed did their CMA pool is almost half size of total memory ? Try reduce CMA pool size, and check did issue reproduce rate become lower?

=> A stupid question. How to configure CMA pool size? Modify CONFIG_CMA_SIZE_MBYTE in menuconfig?
Current CMA pool size and memory information is below
root@ctx0800-c0:~# cat /proc/meminfo | grep Cma
CmaTotal: 983040 kB
CmaFree: 716948 kB
root@ctx0800-c0:~# free -ht
total used free shared buff/cache available
Mem: 1.7G 338M 1.3G 17M 38M 1.3G
Swap: 0B 0B 0B
Total: 1.7G 338M 1.3G

 

In this case, contiguous memory reduced suddenly and imx8 cannot allocate new memory using GFP_ATOMIC.

root@ctx0800-c0:~# cat /proc/buddyinfo
Node 0, zone DMA 169 129 52 52 38 31 28 12 10 5 3 2 3 35
root@ctx0800-c0:~# iperf3 -c 192.168.2.230 -R -u -b 1100M -t 10&
[ 5] 0.00-1.23 sec 14.9 MBytes 102 Mbits/sec 34.335 ms 34640/45402 (76%)
Node 0, zone DMA 396 148 25 13 6 2 2 3 2 0 2 3 4 32
root@ctx0800-c0:~# [ 5] 1.23-2.00 sec 18.0 MBytes 195 Mbits/sec 0.017 ms 56931/69931 (81%)
Node 0, zone DMA 349 144 24 13 5 2 2 3 2 0 3 3 3 29
root@ctx0800-c0:~# [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0.017 ms 0/0 (0%)
Node 0, zone DMA 348 145 23 14 6 3 3 4 2 0 2 2 4 24
root@ctx0800-c0:~# [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0.017 ms 0/0 (0%)
Node 0, zone DMA 329 145 24 13 6 2 3 3 2 1 2 3 1 21
root@ctx0800-c0:~# [ 173.272163] swapper/1: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)

 

3,4,5> still in progress....

 

Thanks,

Sean

 

0 Kudos

3,288 Views
Sean_Lin_Askey
Contributor I

Before OOM happened, the available memory of testing 1.1Gbps throughput is about 1.3Gb(total 1.7Gb) 

I think the issue is related to IRQ of USB3 because I assigned it to CPU1 by command echo 2 > /proc/irq/52/smp_affinity   //IMX8-WU 271 Edge 5b110000.usb3

and page allocation message is CPU1.

[ 325.016123] swapper/1: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)
[ 325.025266] swapper/1 cpuset=/ mems_allowed=0
[ 325.029631] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 4.14.98+g0a45016 #1
[ 325.037637] Hardware name: Freescale i.MX8QXP MEK (DT)

 

I still tried to increase the min_free_kbytes, but it could not stop this issue happened

0 Kudos

3,298 Views
Sean_Lin_Askey
Contributor I

When memory leak happened, kernel log dumped the process list, but there is no process has huge memory usage.

 

0 Kudos

3,298 Views
Sean_Lin_Askey
Contributor I

Hi Bio,

I re-uploaded the log here. Please check again. 

Thanks,

Sean

0 Kudos

3,321 Views
Sean_Lin_Askey
Contributor I

Is there anyone can help?

0 Kudos

3,360 Views
Sean_Lin_Askey
Contributor I

Full log here

0 Kudos

3,192 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

The dts file could could be changed for cma size, 

https://source.codeaurora.org/external/imx/linux-imx/tree/arch/arm64/boot/dts/freescale/fsl-imx8dx.d...

Or  boot parameter "CMA=" could be used.

        cma=nn[MG]@[start[MG][-end[MG]]]
                        [ARM,X86,KNL]
                        Sets the size of kernel global memory area for
                        contiguous memory allocations and optionally the
                        placement constraint by the physical address range of
                        memory allocations. A value of 0 disables CMA
                        altogether. For more information, see
                        include/linux/dma-contiguous.h

 

This issue only reproduced after enabled rps and xps of network stack, it means frequently allocate memory.

And note the using GFP_ATOMIC when issue got.   GFP_ATOMIC means not sleep.

And also note from alloc failure log, it is  allocate order 0 with GFP_ATOMIC.  The first time error log show : 

Node 0 DMA: 181*4kB (UMEC) 60*8kB (UMEC) 484*16kB (UMEC) 2*32kB (U) 1*64kB (C) 0*128kB 1*256kB (C) 1*512kB (C) 1*1024kB (C) 1*2048kB (C) 0*4096kB 1*8192kB (C) 1*16384kB (C) 21*32768kB (C) = 725620kB

As this 181*4KB mark as MIGRATE_CMA,  I doubt  that GFP_ATOMIC can not alloc from MIGRATE_CMA.  If you need find the real reason of this issue, you need debug function __alloc_pages_nodemask.

 

This issue should not related to NXP usb driver code.

Regards

0 Kudos

3,157 Views
Sean_Lin_Askey
Contributor I

Dear Support,

2> Need confirmed did their CMA pool is almost half size of total memory ? Try reduce CMA pool size, and check did issue reproduce rate become lower?

I tried to reduce CMA pool size by modifying arch/arm64/boot/dts/freescale/fsl-imx8dx.dtsi or disable CMA by disable CONFIG_CMA and CONFIG_DMA_CMA.
OOM is still occurred.

Test command is below. This time I only configure xps_cpus

echo f > /sys/class/net/usb0/queues/tx-0/xps_cpus

 

================

We also ask our vendor ALPS of 5G module. They think maybe xhci driver consumes imx8 memory until OOM and need some limitation to save memory consumption.

Is there any additional patch required for imx8 or any suggestion? We suppose the below kind of patch will be necessary to fix OOM issue.

Subject: [PATCH 1/2] xhci: make sure TRB is fully written before giving it to the controller
https://lore.kernel.org/all/20210115161907.2875631-2-mathias.nyman@linux.intel.com/

 

Thanks,

Sean

0 Kudos

2,902 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

As previous stated, if you can not allocate GFP_ATOMIC kind memory for their iperf use case. I also try to run this kind use case on MEK board(by only use 2GB memory as customer board) , but can not reproduce issue.

This issue related to linux kernel common MM sub system code. And I had shared related debug suggestion.

regards

0 Kudos

3,313 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello Sean,

You attached file , seems not a txt log file , it should be a html file, open it from web browser, it is show as loggin for jira; So you need share the log file again.

 

Regards

0 Kudos

3,275 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

This issue more likely related memory

1> In case not enable below cmd, how about test result.

echo 8 > /sys/class/net/usb0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/usb0/queues/tx-0/xps_cpus
echo 2 > /proc/irq/52/smp_affinity //IMX8-WU 271 Edge 5b110000.usb3

2> Need confirmed did their CMA pool is almost half size of total memory ? Try reduce CMA pool size, and check did issue reproduce rate become lower?

3> run kernel memory leak detector on this case.

4> Could customer try reproduce this issue on 8QXP MEK board, with or without their 5G module. For example without 5G module , direct connect with PC , by using usb netowrk?

5> On their board, is there one Ethernet port, if so , do iperf only on Ethernet port, result is ?

 

Regards

0 Kudos

3,140 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi,

Did you get below test result from customer:

3> run kernel memory leak detector on this case.

4> Could you try reproduce this issue on 8QXP MEK board,  with or without their 5G module. For example without 5G module , direct connect with PC , by using usb netowrk?

5> On your board,  is there one Ethernet port, if so , do iperf only on Ethernet port, result is ?

 

The test of reduce CMA size, purpose is for check did issue reproduce rate become lower, not means this issue will gone.

As previous suggested, you should  debug function __alloc_pages_nodemask, to check out why issue occurs, and issue caused by too much allocation from  usb code Or  from network stack ? 

If there NO memory leak,  suggest customer based on finding from debug __alloc_pages_nodemask, to decide need optimize network stack memory usage or other part.

Current can not  say issue related to xhci driver, did you could prove it ?  I do not think you give enough information on this.

I do not think this link will help https://lore.kernel.org/all/20210115161907.2875631-2-mathias.nyman@linux.intel.com/ 

 

Regards

0 Kudos

3,120 Views
Sean_Lin_Askey
Contributor I

Dear Support,

 

3> run kernel memory leak detector on this case.

Sean:
When OOM occurres, console always gets stuck. I could not dump information after OOM from /sys/kernel/debug/kmemleak
But, I still captured the log with lower throughput test. Please check attachment.

4> Could you try reproduce this issue on 8QXP MEK board, with or without their 5G module. For example without 5G module , direct connect with PC , by using usb netowrk?

Sean:
I don't have 8QXP MEK board. I need to contact other department to reproduce this issue.

5> On your board, is there one Ethernet port, if so , do iperf only on Ethernet port, result is ?

Sean:
I've tried on 3 different interfaces on our board. Issue is only happened on usb3 .

IMX8-WU 271 Edge 5b110000.usb3 // for 5G module
=> 5G network: rate drop to 0Mbps and OOM
=> 4G network: rate drop to 0Mbps but no OOM
IMX8-WU 267 Edge 5b0d0000.usb // USB port CN2302 (USB dongle)
=> OK
IMX8-WU 262 Edge 5b050000.ethernet //ethernet port
=> OK

The test of reduce CMA size, purpose is for check did issue reproduce rate become lower, not means this issue will gone.

Sean:
Issue reproduce rate is still same when reduce the CMA pool size or disable CMA.


As previous suggested, you should debug function __alloc_pages_nodemask, to check out why issue occurs, and issue caused by too much allocation from usb code Or from network stack ?
If there NO memory leak, suggest customer based on finding from debug __alloc_pages_nodemask, to decide need optimize network stack memory usage or other part.
Current can not say issue related to xhci driver, did you could prove it ? I do not think you give enough information on this.
I do not think this link will help https://lore.kernel.org/all/20210115161907.2875631-2-mathias.nyman@linux.intel.com/

Sean:
I am not an expert on linux kernel, but from call trace log, I just saw the __alloc_pages_nodemask be triggered by handling xhci irq every time.
The call trace is all the same every OOM happened.
Does this information prove that xhci cause this issue, or do you have any suggestion of debug __alloc_pages_nodemask?

[ 187.104601] Call trace:
[ 187.107048] [<ffff000008089ed8>] dump_backtrace+0x0/0x3c8
[ 187.112452] [<ffff00000808a2b4>] show_stack+0x14/0x20
[ 187.117507] [<ffff000008da6940>] dump_stack+0x9c/0xbc
[ 187.122562] [<ffff000008196458>] warn_alloc+0xe8/0x180
[ 187.127706] [<ffff000008197040>] __alloc_pages_nodemask+0xae8/0xb38
[ 187.133980] [<ffff0000081971e0>] page_frag_alloc+0x150/0x170
[ 187.139642] [<ffff000008b7a918>] __netdev_alloc_skb+0xb0/0x138
[ 187.145479] [<ffff00000885d9cc>] rx_submit+0x44/0x260
[ 187.150535] [<ffff00000885de6c>] rx_complete+0x1fc/0x228
[ 187.155854] [<ffff00000886d608>] __usb_hcd_giveback_urb+0x60/0xf0
[ 187.161953] [<ffff00000886d850>] usb_hcd_giveback_urb+0xe8/0x120
[ 187.167964] [<ffff0000088cc6c4>] xhci_giveback_urb_in_irq.isra.25+0x84/0xb0
[ 187.174931] [<ffff0000088cc8c8>] xhci_td_cleanup+0xd0/0x118
[ 187.180507] [<ffff0000088d04f8>] finish_td.isra.46+0xe8/0x120
[ 187.186258] [<ffff0000088d09e0>] xhci_irq+0x4b0/0x1528
[ 187.191401] [<ffff00000886d324>] usb_hcd_irq+0x2c/0x48
[ 187.196545] [<ffff0000088ae070>] cdns3_host_irq+0x28/0x40
[ 187.201948] [<ffff0000088a8ad8>] cdns3_irq+0xa0/0xa8
[ 187.206918] [<ffff00000811ca74>] __handle_irq_event_percpu+0x5c/0x148
[ 187.213364] [<ffff00000811cb7c>] handle_irq_event_percpu+0x1c/0x58
[ 187.219549] [<ffff00000811cc00>] handle_irq_event+0x48/0x78
[ 187.225127] [<ffff000008120a00>] handle_fasteoi_irq+0xa8/0x180
[ 187.230964] [<ffff00000811bb94>] generic_handle_irq+0x24/0x38
[ 187.236716] [<ffff00000811c214>] __handle_domain_irq+0x5c/0xb8
[ 187.242553] [<ffff000008081960>] gic_handle_irq+0x78/0x174

Many thanks,

Sean

0 Kudos

3,062 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hi, 

My understanding:  i.MX8QXP as usb host,  iperf client,  5G module as  usb device, iperf host (not sure), and  some special network stack change to make iperf result go to 1.1Gbps, right? 

As customer no MEK board,  suggest connect board to Linux PC,  try test under above condition (I suggest using NCM, if they using ECM, please share how to using ECM),  and if reproduced,  share detail steps.  Or using two sets  customer  i.MX8QXP board  connected ?

Current i had started try to reproduce by using i.MX8QXP MEK, did not reproduce it yet. Need customer side provide reproduce steps.

The callstack ,  there are some xhci function, but doesnot mean issue caused by xhci driver.

Below call stack show: on xhci irq context, one urb is finished transfer,  that urb will return to device driver which is (drivers/net/usb/usbnet.c), callback is rx_complete,  then will enter to network stack to allocate memory , then alloc failure occurs. 

 warn_alloc+0xe8/0x180
__alloc_pages_nodemask+0xae8/0xb38

 page_frag_alloc+0x150/0x170
__netdev_alloc_skb+0xb0/0x138   //this code from network stack 

 rx_submit+0x44/0x260           //this two function from drivers/net/usb/usbnet.c 
 rx_complete+0x1fc/0x228 

 __usb_hcd_giveback_urb+0x60/0xf0
 usb_hcd_giveback_urb+0xe8/0x120
 xhci_giveback_urb_in_irq.isra.25+0x84/0xb0
xhci_td_cleanup+0xd0/0x118
 finish_td.isra.46+0xe8/0x120
xhci_irq+0x4b0/0x1528

 

The issue log "page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC),",  please note it is GFP_ATOMIC,  and only one page, so customer should add related debug log into function __alloc_pages_slowpath , to check out why it alloc failure.

Regards

0 Kudos

3,051 Views
Sean_Lin_Askey
Contributor I

Dear Support,

I tried to insert some log in __alloc_pages_slowpath and found the fail reason is from this part and goto nopage  then warn_alloc.

/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim) 
goto nopage;

Kernel could not directly reclaim pages if out of pages.

 

Thanks,

Sean

0 Kudos