fec skb page allocation failure

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

fec skb page allocation failure

Jump to solution
2,702 Views
brendanpeter
Contributor III

I have imx6 solo and imx6 dual lite boards running the 4.1.15 kernel created by yocto 2.0.1. After about 2 days of uptime I see a series of skb page allocation failures from the FEC driver. A google search turns up some issues from a few years ago that look similar:

Re: [Question] page allocation failure — Linux Memory Management 

ENGR00277698 net:fec: avoid kernel dump for skb page allocation fail (20a0a6b2) · Commits · ARM / qm... 

but nothing more recent. Are there any more recent driver changes that have addressed this issue?

Here's a sample page allocation failure:

2017-03-05T16:35:55.617405-08:00 kernel: swapper/0: page allocation failure: order:0, mode:0x20
2017-03-05T16:35:55.624786-08:00 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O    4.1.15-1.1.1+gd5d7c02 #1
2017-03-05T16:35:55.624853-08:00 kernel: Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
2017-03-05T16:35:55.624872-08:00 kernel: [<80015b24>] (unwind_backtrace) from [<800126bc>] (show_stack+0x10/0x14)
2017-03-05T16:35:55.624886-08:00 kernel: [<800126bc>] (show_stack) from [<804de6e0>] (dump_stack+0x84/0xc4)
2017-03-05T16:35:55.624900-08:00 kernel: [<804de6e0>] (dump_stack) from [<800abf8c>] (warn_alloc_failed+0xe4/0x120)
2017-03-05T16:35:55.624925-08:00 kernel: [<800abf8c>] (warn_alloc_failed) from [<800ae7e8>] (__alloc_pages_nodemask+0x504/0x8bc)
2017-03-05T16:35:55.624942-08:00 kernel: [<800ae7e8>] (__alloc_pages_nodemask) from [<803bd390>] (__alloc_page_frag+0x13c/0x15c)
2017-03-05T16:35:55.624957-08:00 kernel: [<803bd390>] (__alloc_page_frag) from [<803c2bbc>] (__alloc_rx_skb+0x58/0xe4)
2017-03-05T16:35:55.624970-08:00 kernel: [<803c2bbc>] (__alloc_rx_skb) from [<803c2c64>] (__netdev_alloc_skb+0x1c/0x44)
2017-03-05T16:35:55.624984-08:00 kernel: [<803c2c64>] (__netdev_alloc_skb) from [<802b7388>] (fec_enet_rx_napi+0x4e8/0xc88)
2017-03-05T16:35:55.624997-08:00 kernel: [<802b7388>] (fec_enet_rx_napi) from [<803cd6d4>] (net_rx_action+0x1d8/0x2d4)
2017-03-05T16:35:55.625011-08:00 kernel: [<803cd6d4>] (net_rx_action) from [<8002e5d4>] (__do_softirq+0x120/0x238)
2017-03-05T16:35:55.625024-08:00 kernel: [<8002e5d4>] (__do_softirq) from [<8002e9b4>] (irq_exit+0xc0/0xfc)
2017-03-05T16:35:55.625042-08:00 kernel: [<8002e9b4>] (irq_exit) from [<80062be8>] (__handle_domain_irq+0x80/0xec)
2017-03-05T16:35:55.625059-08:00 kernel: [<80062be8>] (__handle_domain_irq) from [<800093c0>] (gic_handle_irq+0x24/0x5c)
2017-03-05T16:35:55.625072-08:00 kernel: [<800093c0>] (gic_handle_irq) from [<800131c0>] (__irq_svc+0x40/0x74)
2017-03-05T16:35:55.625085-08:00 kernel: Exception stack(0x80691f18 to 0x80691f60)
2017-03-05T16:35:55.625098-08:00 kernel: 1f00:                                                       80691f60 fffffff7
2017-03-05T16:35:55.625111-08:00 kernel: 1f20: f9433aad 00009922 9fb21e90 00000000 f8fbb1eb 00009922 f9433aad 00009922
2017-03-05T16:35:55.625123-08:00 kernel: 1f40: 00000001 00000000 00000017 80691f60 a6ae4b18 8031375c 800f0013 ffffffff
2017-03-05T16:35:55.625136-08:00 kernel: [<800131c0>] (__irq_svc) from [<8031375c>] (cpuidle_enter_state+0xd8/0x20c)
2017-03-05T16:35:55.625156-08:00 kernel: [<8031375c>] (cpuidle_enter_state) from [<8005ab64>] (cpu_startup_entry+0x1fc/0x320)
2017-03-05T16:35:55.625176-08:00 kernel: [<8005ab64>] (cpu_startup_entry) from [<80652c68>] (start_kernel+0x398/0x3a4)
2017-03-05T16:35:55.625190-08:00 kernel: Mem-Info:
2017-03-05T16:35:55.625204-08:00 kernel: active_anon:9598 inactive_anon:51 isolated_anon:0
2017-03-05T16:35:55.625217-08:00 kernel: active_file:713 inactive_file:719 isolated_file:0
2017-03-05T16:35:55.625231-08:00 kernel: unevictable:0 dirty:0 writeback:0 unstable:0
2017-03-05T16:35:55.625244-08:00 kernel: slab_reclaimable:324 slab_unreclaimable:28370
2017-03-05T16:35:55.625256-08:00 kernel: mapped:1418 shmem:175 pagetables:395 bounce:0
2017-03-05T16:35:55.625273-08:00 kernel: free:39329 free_pcp:87 free_cma:38164
2017-03-05T16:35:55.625303-08:00 kernel: Normal free:157316kB min:2860kB low:3572kB high:4288kB active_anon:38392kB inactive_anon:204kB active_file:2852kB inactive_file:2876kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:511812kB mlocked:0kB dirty:0kB writeback:0kB mapped:5672kB shmem:700kB slab_reclaimable:1296kB slab_unreclaimable:113480kB kernel_stack:848kB pagetables:1580kB unstable:0kB bounce:0kB free_pcp:348kB local_pcp:100kB free_cma:152656kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
2017-03-05T16:35:55.625332-08:00 kernel: lowmem_reserve[]: 0 0
2017-03-05T16:35:55.625355-08:00 kernel: Normal: 744*4kB (EMRC) 171*8kB (UMRC) 51*16kB (UMRC) 18*32kB (UC) 3*64kB (C) 5*128kB (C) 1*256kB (C) 0*512kB 1*1024kB (C) 1*2048kB (C) 0*4096kB 0*8192kB 1*16384kB (C) 4*32768kB (C) = 157352kB
2017-03-05T16:35:55.625374-08:00 kernel: 1631 total pagecache pages
2017-03-05T16:35:55.625392-08:00 kernel: 0 pages in swap cache
2017-03-05T16:35:55.625409-08:00 kernel: Swap cache stats: add 0, delete 0, find 0/0
2017-03-05T16:35:55.625427-08:00 kernel: Free swap  = 0kB
2017-03-05T16:35:55.625452-08:00 kernel: Total swap = 0kB
2017-03-05T16:35:55.625473-08:00 kernel: 131072 pages RAM
2017-03-05T16:35:55.625490-08:00 kernel: 0 pages HighMem/MovableOnly
2017-03-05T16:35:55.625509-08:00 kernel: 4294888495 pages reserved
2017-03-05T16:35:55.625526-08:00 kernel: 81920 pages cma reserved
Labels (1)
0 Kudos
1 Solution
1,949 Views
david_wolfe
NXP Employee
NXP Employee

The root cause of a failure to allocate SKBs must be a lack of free pages in free_page_list. Please change min_free_kbytes for the page reclaim threshold. The default on my system is 30 MiB. I recommend setting it to 60 MiB to start. This will trigger kswapd to free up memory earlier and, hopefully, avoid the failure to allocate memory.


echo 60000 > /proc/sys/vm/min_free_kbytes


Conversely, I set my own min_free_kbytes to ridiculously low levels and have not yet reproduced the issue.

Update: Setting min_free_kbytes to 500 reproduces the netdev_alloc_skb() failure within about an hour lending confidence to this hypothesis.


--
David

View solution in original post

0 Kudos
3 Replies
1,950 Views
david_wolfe
NXP Employee
NXP Employee

The root cause of a failure to allocate SKBs must be a lack of free pages in free_page_list. Please change min_free_kbytes for the page reclaim threshold. The default on my system is 30 MiB. I recommend setting it to 60 MiB to start. This will trigger kswapd to free up memory earlier and, hopefully, avoid the failure to allocate memory.


echo 60000 > /proc/sys/vm/min_free_kbytes


Conversely, I set my own min_free_kbytes to ridiculously low levels and have not yet reproduced the issue.

Update: Setting min_free_kbytes to 500 reproduces the netdev_alloc_skb() failure within about an hour lending confidence to this hypothesis.


--
David

0 Kudos
1,949 Views
david_wolfe
NXP Employee
NXP Employee

Brendan,


The author of commit 20a0a6b2 (on branch imx_3.10.17_1.0.2_ga) admits no root cause was found and the change just masks the problem by preventing the kernel dump. The SKB page allocation still occasionally fails regardless.


It is difficult to trace specific changes in the FEC driver from 3.10.17 to 4.1.15 because there is no direct patchset. Inspection of the 4.1.15 fec_main.c indicates to me significant reorganization of the FEC driver and that commit 20a0a6b2 was not propagated.

I'm having difficulty reproducing the kernel dump in 4.1.15_2.0.1 on an i.MX 6DL SabreSD board. I'm mounting an NFSv3 file system and reading files. I set mem=500M on the kernel command line as suggested in the URLs you provided. But I've been running this for less than a day. Have you (or anyone? anyone?) discovered a faster way to reproduce?


You could effect 20a0a6b2 in 4.1.15 around the netdev_alloc_skb() call in fec_enet_rx_queue() where much of the fec_enet_rx() functionality was moved. But first I would ask if you observe any other ill effects besides messages in the kernel log?

--
David

0 Kudos
1,949 Views
brendanpeter
Contributor III

Thanks for the reply. I have not attempted to port the 20a0a6b2 changes since as you state the change just masks the problem.

I've found it is not 100% reproducible, and I'm still working on a quicker method of reproducing it. I have found that when this occurs, Linux has become fairly unresponsive and simple operations take multiple seconds to complete. I have not been able to characterize it fully since it can be difficult to reproduce.

If no one else has experienced this problem then it may be something specific to my configuration.

0 Kudos