imx8qmmek + OpenCL linux kernel panics

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

imx8qmmek + OpenCL linux kernel panics

Jump to solution
2,061 Views
lavusedu
Contributor II
Hello, I've been doing research with OpenCL on the imx8qmmek board and I keep running into linux kernel panics when I run intensive work through OpenCL.

My setup is not standard, but I can't trace why the panics are happening and I was wondering if this issue might already be known.

I'm building hardknott bitbake image which gives me the 5.10.35+git0+ef3f2cfc60-r0 kernel image. I.e. my build steps are:

repo init -u https://source.codeaurora.org/external/imx/imx-manifest -b imx-linux-hardknott -m imx-5.10.35-2.0.0.xml && repo sync
DISTRO=fsl-imx-wayland MACHINE=imx8qmmek source imx-setup-release.sh -b build
// add my layer
bitbake imx-image-multimedia

My layer in between does the following things:
* Build standard opencl-headers, opencl-icd-loader and pocl
* Updates imx-gpu-viv to not provide opencl-headers and opencl-icd-loader so that the standard ones are used
* Renames libOpenCL.so from imx-gpu-viv to libVivanteOpenCL.so, patches soname and uses the 3.0.0 version of the binary blob shipped by NXP instead of the 1.2.0 - the reason for this is that the 3.0.0 version is ICD loadable, while the 1.2.0 is not (the opencl features say otherwise, but the 3.0.0 actually loads and does the work) - I understand that this is weird, but I needed ICD loadability for my research

So far this works ok for me. My OpenCL kernels work with both pocl on cpus and imx-gpu-viv on gpus. However, after running intensive work through OpenCL for some time (around an hour) the linux kernel panics. The longer the kernel instances run the bigger chance of panic - if I run smaller kernel instances more times the chance is lower, but the panic still happens.

The panics seem to be all over the place, seemingly caused by memory corruption. So far I have logs from three:

The first one is:
 
Spoiler
[ 8363.441576] audit: type=1006 audit(1635255001.440:6): pid=5876 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=5 res=1
[ 8836.490505] Unable to handle kernel paging request at virtual address 00000000003cee5b
[ 8836.498424] Mem abort info:
[ 8836.501214] ESR = 0x96000004
[ 8836.504264] EC = 0x25: DABT (current EL), IL = 32 bits
[ 8836.509580] SET = 0, FnV = 0
[ 8836.512628] EA = 0, S1PTW = 0
[ 8836.515766] Data abort info:
[ 8836.518641] ISV = 0, ISS = 0x00000004
[ 8836.522470] CM = 0, WnR = 0
[ 8836.525436] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089ad9a000
[ 8836.531880] [00000000003cee5b] pgd=0000000000000000, p4d=0000000000000000
[ 8836.538676] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 8836.544249] Modules linked in: fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine rng_core authenc libdes crct10dif_ce mxc_jpeg_encdec imx8_media_dev(C) caam error fuse
[ 8836.561586] CPU: 5 PID: 6209 Comm: mandelbrot2_mul Tainted: G C 5.10.35-lts-5.10.y+gef3f2cfc6010 #1
[ 8836.571938] Hardware name: Freescale i.MX8QM MEK (DT)
[ 8836.576989] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[ 8836.583004] pc : update_load_avg+0x24/0x410 [0/911]
[ 8836.587187] lr : task_tick_fair+0x7c/0x310
[ 8836.591284] sp : ffff800011ce3d30
[ 8836.594594] x29: ffff800011ce3d30 x28: ffff0008ff451140
[ 8836.599910] x27: ffff0008ff451180 x26: ffff80001011dae0
[ 8836.605227] x25: 0000000000000000 x24: ffff0008ff454980
[ 8836.610543] x23: ffff800011ad0470 x22: 0000000000000001
[ 8836.615852] x21: ffff000814f51e00 x20: ffff0008142aab80
[ 8836.621169] x19: 00000000003ced2b x18: 0000000000000000
[ 8836.626486] x17: 0000000000000000 x16: 0000000000000000
[ 8836.631802] x15: 0000000000000000 x14: 0000000000000000
[ 8836.637119] x13: 0000000000000000 x12: 00000000000000c9
[ 8836.642436] x11: 0000000000000004 x10: 0000000000000000
[ 8836.647753] x9 : ffff0008ff455480 x8 : ffff0008ff454980
[ 8836.653072] x7 : 00000000000000c9 x6 : 0000000000000000
[ 8836.658386] x5 : 000000000000b747 x4 : 0000000000000000
[ 8836.663703] x3 : 0000000000000000 x2 : 0000000000000001
[ 8836.669020] x1 : ffff0008142aab80 x0 : 00000000003ced2b
[ 8836.674339] Call trace:
[ 8836.676788] update_load_avg+0x24/0x410
[ 8836.680619] task_tick_fair+0x7c/0x310
[ 8836.684366] scheduler_tick+0xa0/0x134
[ 8836.688120] update_process_times+0x8c/0xa0
[ 8836.692307] tick_sched_handle+0x34/0x60
[ 8836.696229] tick_sched_timer+0x4c/0xa4
[ 8836.700062] __hrtimer_run_queues+0x140/0x1e0
[ 8836.704413] hrtimer_interrupt+0xe8/0x2c0
[ 8836.708423] arch_timer_handler_phys+0x38/0x50
[ 8836.712871] handle_percpu_devid_irq+0x84/0x150
[ 8836.717404] __handle_domain_irq+0x7c/0xe0
[ 8836.721511] gic_handle_irq+0xc0/0x140
[ 8836.725258] el0_irq_naked+0x4c/0x54
[ 8836.728840] Code: aa0103f4 a9025bf5 2a0203f6 a90363f7 (f9409800)
[ 8836.734940] ---[ end trace 8161f122673b4249 ]---
[ 8836.739557] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 8836.746435] SMP: stopping secondary CPUs
[ 8836.750358] Kernel Offset: disabled
[ 8836.753848] CPU features: 0x0240022,2100600c
[ 8836.758121] Memory Limit: none
[ 8836.761175] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
 


The second one:
 
Spoiler
[ 3413.211112] Internal error: SP/PC alignment exception: 8a000000 [#1] PREEMPT SMP
[ 3413.218515] Modules linked in: fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine rng_core authenc libdes crct10dif_ce mxc_jpeg_encdec imx8_media_dev(C) caam error fuse
[ 3413.235848] CPU: 4 PID: 6033 Comm: mandelbrot2_mul Tainted: G C 5.10.35-lts-5.10.y+gef3f2cfc6010 #1
[ 3413.246200] Hardware name: Freescale i.MX8QM MEK (DT)
[ 3413.251251] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[ 3413.257262] pc : 0x1
[ 3413.259448] lr : 0x1
[ 3413.261629] sp : ffff800011cdbe20
[ 3413.264939] x29: ffff000811098000 x28: ffff0008ff43a140
[ 3413.270254] x27: ffff0008ff43a180 x26: ffff80001011dae0 [13/1833]
[ 3413.275563] x25: 0000000000000002 x24: 0000000000000001
[ 3413.280880] x23: ffff000811098000 x22: 0000000000000004
[ 3413.286195] x21: ffff800011ab8a40 x20: 0000031ab56f42a2
[ 3413.291505] x19: ffff80001ec2beb0 x18: 0000000000000000
[ 3413.296822] x17: 0000000000000000 x16: 0000000000000000
[ 3413.302139] x15: 0000000000000000 x14: 0000000000000000
[ 3413.307458] x13: 0000000000000000 x12: 0000000000000040
[ 3413.312772] x11: ffff0008120fd918 x10: ffff0008120fd91a
[ 3413.318089] x9 : ffff800011b41710 x8 : ffff800011b42000
[ 3413.323405] x7 : ffff8000118ad000 x6 : 000000041e074e84
[ 3413.328722] x5 : 00ffffffffffffff x4 : 0000000000000016
[ 3413.334039] x3 : ffffffffffffffff x2 : 0000000000000000
[ 3413.339356] x1 : 3f242cc4a7829d00 x0 : 0000000000000000 [0/1833]
[ 3413.344673] Call trace:
[ 3413.347113] 0x1
[ 3413.348951] Code: bad PC value
[ 3413.352013] ---[ end trace b3274eb17ce33fc7 ]---
[ 3413.356630] Kernel panic - not syncing: SP/PC alignment exception: Fatal exception in interrupt
[ 3413.365332] SMP: stopping secondary CPUs
[ 3414.369255] SMP: failed to stop secondary CPUs 0,4
[ 3414.374049] Kernel Offset: disabled
[ 3414.377539] CPU features: 0x0240022,2100600c
[ 3414.381806] Memory Limit: none
[ 3414.384865] ---[ end Kernel panic - not syncing: SP/PC alignment exception: Fatal exception in interrupt ]---
 


The third one:
 
Spoiler
[ 7401.959259] Unable to handle kernel execute from non-executable memory at virtual address ffff0008159fa000
[ 7401.968921] Mem abort info:
[ 7401.971709] ESR = 0x8600000f
[ 7401.974761] EC = 0x21: IABT (current EL), IL = 32 bits
[ 7401.980075] SET = 0, FnV = 0
[ 7401.983125] EA = 0, S1PTW = 0
[ 7401.986264] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081bd3000
[ 7401.992967] [ffff0008159fa000] pgd=000000097fff8003, p4d=000000097fff8003, pud=000000097fc89003, pmd=000000097fc5c003, pte=00680008959fa707
[ 7402.005498] Internal error: Oops: 8600000f [#1] PREEMPT SMP
[ 7402.011073] Modules linked in: fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine rng_core authenc libdes crct10dif_ce mxc_jpeg_encdec imx8_media_dev(C) caam error fuse
[ 7402.028408] CPU: 5 PID: 11108 Comm: mandelbrot2_mul Tainted: G C 5.10.35-lts-5.10.y+gef3f2cfc6010 #1
[ 7402.038847] Hardware name: Freescale i.MX8QM MEK (DT)
[ 7402.043898] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[ 7402.049911] pc : 0xffff0008159fa000
[ 7402.053397] lr : 0xffff0008159fa000
[ 7402.056881] sp : ffff800011ce3dd0
[ 7402.060191] x29: ffff00081086e480 x28: ffff0008ff451140
[ 7402.065507] x27: ffff0008ff451180 x26: ffff80001011dae0
[ 7402.070825] x25: 0000000000000000 x24: ffff8000118ac980
[ 7402.076141] x23: ffff0008ff454980 x22: ffff8000100b9cc0
[ 7402.081458] x21: ffff800011ce3df0 x20: ffff8000100bd934
[ 7402.086775] x19: 0000000000000000 x18: 0000000000000000
[ 7402.092092] x17: 0000000000000000 x16: 0000000000000000
[ 7402.097408] x15: 0000a9c73d98f53e x14: 00000000000001c4
[ 7402.102723] x13: 0000000000000001 x12: 0000000000000004
[ 7402.108033] x11: 0000000000000001 x10: 00000000000001c4
[ 7402.113350] x9 : 00000000006cd246 x8 : ffff0008ff454a80
[ 7402.118667] x7 : ffff800011ab8a40 x6 : ffff8000118ac980
[ 7402.123984] x5 : 0000000000000400 x4 : 0000000000000403
[ 7402.129301] x3 : 0000000000000002 x2 : ffff0008100ec000
[ 7402.134617] x1 : ffff0008ff454980 x0 : ffff8000118ac980
[ 7402.139934] Call trace:
[ 7402.142376] 0xffff0008159fa000
[ 7402.145518] Code: 00000000 00000000 00000000 00000000 (00100000)
[ 7402.151618] ---[ end trace afa4dc373e0fd1cb ]---
[ 7402.156235] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 7402.163113] SMP: stopping secondary CPUs
[ 7402.167036] Kernel Offset: disabled
[ 7402.170526] CPU features: 0x0240022,2100600c
[ 7402.174797] Memory Limit: none
[ 7402.177854] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
 


Has this issue happened to anyone before? Are there any recommended things to try to debug this?

Thanks in advance,
Edward
0 Kudos
1 Solution
2,039 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

 

We have some opencl fixes in gpu driver since 5.10.35,

I don't know if this one related to your case

MGS-4022 [#imx-1070] fix kernel panic with opencl test_buffers

the user memory will add the padding pages to meet hardware alignment,
need set non-contiguous flag to avoid contigous mapping in GPU MMU.

Signed-off-by: Xianzhong <xianzhong.li@nxp.com>

 

This fix is in 5.10.52.2.1.0, Can you try this gpu version to see if it fixes your issue.

 

Regards

 

View solution in original post

0 Kudos
2 Replies
2,040 Views
Bio_TICFSL
NXP TechSupport
NXP TechSupport

Hello,

 

We have some opencl fixes in gpu driver since 5.10.35,

I don't know if this one related to your case

MGS-4022 [#imx-1070] fix kernel panic with opencl test_buffers

the user memory will add the padding pages to meet hardware alignment,
need set non-contiguous flag to avoid contigous mapping in GPU MMU.

Signed-off-by: Xianzhong <xianzhong.li@nxp.com>

 

This fix is in 5.10.52.2.1.0, Can you try this gpu version to see if it fixes your issue.

 

Regards

 

0 Kudos
1,824 Views
lavusedu
Contributor II


Hello,

it took some time but I just got around to trying this (or rather, an even newer version 5.10.72-2.2.0) and it appears to have solved my problem.

Thank you for your help.

0 Kudos