BUG of IPCF access NULL pointer

jun13_chen · ‎07-04-2024

We found that the IPCF of core A has a bug in the multi-instance scenario. The latest IPCF version you released also has this bug.
Please see following log：

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
ipc-shm-sample: IPCF_Driver_Version:v6.5
Mem abort info:
ipc-shm-sample: IPCF_Driver_Version:v6.5
ESR = 0x0000000096000005
ipc-shm-sample: IPCF_Driver_Version:v6.5
EC = 0x25: DABT (current EL), IL = 32 bits
ipc-shm-sample: IPCF_Driver_Version:v6.5
SET = 0, FnV = 0
ipc-shm-sample: IPCF_Driver_Version:v6.5
EA = 0, S1PTW = 0
ipc-shm-sample: IPCF_Driver_Version:v6.5
FSC = 0x05: level 1 translation fault
ipc-shm-sample: IPCF_Driver_Version:v6.5
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp=0000000081c7a000
[0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] PREEMPT SMP
printk: console [ttyLF0]: printing thread stopped
Modules linked in: ipc_shm_sample(O) ipc_shm_dev(O) pfeng_slave(O)
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 5.15.119 #69
Hardware name: NXP S32G3XXX-EVB3 (DT)
pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : ipc_queue_pop+0x24/0x84 [ipc_shm_dev]
lr : 0xffffffe00291a108
sp : ffffffc008003db0
x29: ffffffc008003db0 x28: ffffffe00291fd28 x27: 00000000ffffff97
x26: 0000000000000000 x25: 0000000000000001 x24: 00000000000008d0
x23: ffffffe00291f340 x22: 0000000000000001 x21: ffffffe00291fd38
x20: 0000000000000040 x19: ffffffe00291fd38 x18: fffffffe007ef540
x17: ffffffa0783f8000 x16: ffffffc008000000 x15: 0000000000000000
x14: 0000000000000000 x13: 000000000000013e x12: 0000000000000001
x11: 0000000000000040 x10: ffffffa0783f8000 x9 : 0000000000000000
x8 : ffffffc008980038 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffffffc008900038 x4 : 0000000000000080 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffffc008003e50
Call trace:
ipc_queue_pop+0x24/0x84 [ipc_shm_dev]
ipc_queue_init+0xb8/0x200 [ipc_shm_dev]
tasklet_action_common.constprop.0+0x13c/0x14c
tasklet_action+0x24/0x30
_stext+0x11c/0x288
__irq_exit_rcu+0xa4/0xdc
irq_exit+0xc/0x1c
handle_domain_irq+0x60/0x90
gic_handle_irq+0x50/0x120
call_on_irq_stack+0x20/0x3c
do_interrupt_handler+0x4c/0x54
el1_interrupt+0x2c/0x70
el1h_64_irq_handler+0x14/0x20
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x14/0x20
do_idle+0xc0/0x144
cpu_startup_entry+0x24/0x50
rest_init+0xdc/0xf0
arch_call_rest_init+0xc/0x14
start_kernel+0x500/0x53c
__primary_switched+0xbc/0xc4
Code: aa0003f3 540002e0 aa0103e0 a9408663 (b9400022)
---[ end trace 31e2f3acfa5788b1 ]---
Kernel panic - not syncing:

Please help us check and verify whether it is true, and whether the modified plan is reasonable or there is a better way.
Also evaluate whether the M-core IPCF has the same problem, and what are the modification plans?

The IPCF version we use is SW32G_IPCF_4.8.0_D2212, and the M core is also
According to the call stack and registers, the analysis is that queue->pop_ring in ipc_queue_pop is a null pointer, that is, mchan->bd_queue

Looking at the code combined with the phenomenon analysis, the root cause is:
1. In the function ipc_shm_init_instance. Instance 0 will turn on the interrupt of instance 0 at the end of the function
2. Then instance 1 will set the IPC_SHM_INSTANCE_ENABLED flag in ipc_os_init
3. ipc_shm_channel_init will initialize mchan->bd_queue of instance 1
4. If there is an interrupt of instance 0 between step 2 and step 3
5. In the interrupt processing function, because the mark of instance 1 has been set, it will traverse to mchan->bd_queue of instance 1
Because this data structure is not initialized, a null pointer is accessed.

We have changed a plan here, and it has been verified that it is effective,
which is to move the interrupt enable step in step 1 to after the initialization of all instances is completed.

Spoiler

diff --git a/ipc-shm110460k/ipc-shm.c b/ipc-shm110460k/ipc-shm.c

index 7716f16..3e4fc0f 100644

--- a/ipc-shm110460k/ipc-shm.c

+++ b/ipc-shm110460k/ipc-shm.c

@@ -669,9 +669,6 @@ static int ipc_shm_init_instance(uint8_t instance,

remote_chan_shm += chan_size;

}

- /* enable interrupt notifications */

- ipc_hw_irq_enable(instance);

-

ipc_shm_priv_data[instance].global->state = IPC_SHM_STATE_READY;

shm_dbg("ipc shm initialized\n");

@@ -771,6 +768,14 @@ int ipc_shm_init(const struct ipc_shm_instances_cfg *cfg)

if (err != 0)

return err;

}

+

+ for (i = 0; i < cfg->num_instances; i++) {

+ /*

+ * enable irq only after all instances are initialized

+ */

+ ipc_hw_irq_enable(i);

+ }

+

return 0;

}

diff --git a/ipc-shm110460k/ipc-shm.c b/ipc-shm110460k/ipc-shm.cindex 7716f16..3e4fc0f 100644--- a/ipc-shm110460k/ipc-shm.c+++ b/ipc-shm110460k/ipc-shm.c@@ -669,9 +669,6 @@ static int ipc_shm_init_instance(uint8_t instance, remote_chan_shm += chan_size; } - /* enable interrupt notifications */- ipc_hw_irq_enable(instance);- ipc_shm_priv_data[instance].global->state = IPC_SHM_STATE_READY; shm_dbg("ipc shm initialized\n"); @@ -771,6 +768,14 @@ int ipc_shm_init(const struct ipc_shm_instances_cfg *cfg) if (err != 0) return err; }++ for (i = 0; i < cfg->num_instances; i++) {+ /*+ * enable irq only after all instances are initialized+ */+ ipc_hw_irq_enable(i);+ }+ return 0;}

chenyin_h · ‎07-05-2024

Hello, @jun13_chen

Thanks for the post and sorry for the issues.

Would you mind introducing us the brief test method that would help to reproduce the issue? I will test it on local board and discuss with internal experts for it if reproduced.

Thanks in advance.

Best Regards

Chenyin

jun13_chen · ‎07-07-2024

Configure ipcf to have two instances, and any one or more channels in the two instances.
1. First initialize the entire IPCF SRAM area and set it to all 0s.
2. Core A and core M call ipc_shm_init to initialize at the same time.
3. The M core may be initialized first, and then during the initialization process of the A core, the M core
A channel in instance0 continuously sends data.
In addition, we suspect that there is also a problem with the M core.
In the third step, the A core is initialized first, and then during the M core initialization process, the A core goes to
A channel in instance0 continuously sends data.

The above steps can reproduce the problem.