AnsweredAssumed Answered

iMX6Q PCIe stuck / hanged kernel or HW (SoC)

Question asked by Primoz Fiser on Dec 1, 2016
Latest reply on Feb 21, 2017 by Primoz Fiser

Hello all,

 

we are using PCIe communication between two iMX6Q SoMs. One iMX6Q's software is based on Linux (fslc 4.1.15) and the other is running bare-metal based on freescale SDK. So the Linux side is acting as RC and bare-metal side is implementing EP.

 

We have developed drivers for both sides and communication is working as expected until bare-metal EP side is reset/power cycled...

After EP reset, accessing BAR memory (read() syscall on device driver is accessing BAR memory for example) will cause RC - Linux side to hang. Kernel/SoC is completely stuck and the only option is power cycling!!!

 

We can actually prevent kernel / HW from hanging itself by performing this sequence when we want to reset EP:

echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
sleep 5                                       
(reset/power cycle EP device)             
sleep 5                                   
echo 1 > /sys/class/pci_bus/0000\:01/rescan

 

However we don't always know when EP is going to be reset since we don't send such information to the Linux side. Therefore we need a better way of knowing / preventing kernel / HW from hanging itself?

 

We have also noticed that by reading "RC status and command" register we are able to tell if accessing BAR memory is safe or not:

*48.11.2 Command and Status Register (PCIE_RC_Command)*

Address: 1FF_C000h base + 4h offset = 1FF_C004h

Value after fresh reboot:

> [root@host ~]# ./memtool -32 0x01FFC004 1
> Reading 0x1 count starting at address 0x01FFC004
>
> 0x01FFC004:  00100547
>
> [root@host ~]#

Value after EP reset / power cycle:

> [root@host ~]# ./memtool -32 0x01FFC004 1
> Reading 0x1 count starting at address 0x01FFC004
>
> 0x01FFC004:  00100000
>
> [root@host ~]#

Value after running above commands:

> [root@host ~]# echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
> [root@host ~]# echo 1 > /sys/class/pci_bus/0000\:01/rescan
> [root@host ~]# ./memtool -32 0x01FFC004 1
> Reading 0x1 count starting at address 0x01FFC004
>
> 0x01FFC004:  00100006
>
> [root@host ~]#

Short explanation:

 

*If you read "0x01FFC004" and the return value is "0x00100000", you
don't want to access PCI EP device (kernel hangs).*

 

When return values are "0x00100547" or "0x00100006" you can safely
access PCI device without kernel hang.

 

Example output:

> [root@host ~]# ./memtool -32 0x01FFC004 1
> Reading 0x1 count starting at address 0x01FFC004
>
> 0x01FFC004:  00100006
> [root@host ~]# dd if=/dev/imx6ep bs=1 count=1
> 1+0 records in
> 1+0 records out
> 1 byte copied, 0.03237 s, 0.0 kB/s
> [root@host ~]#
>
>     (RESET EP HERE)
>
> [root@host ~]# ./memtool -32 0x01FFC004 1
> Reading 0x1 count starting at address 0x01FFC004
>
> 0x01FFC004:  00100000
>
> [root@host ~]# dd if=/dev/imx6ep bs=1 count=1
>
>       (STUCK KERNEL / HW HERE)

 

We thus developed a Linux driver function that checks state of register value and we call it before every access to memory:

/* check PCI RC Status register on iMX6Q */
static int imx6ep_pci_rc_status(struct pci_dev *pdev)
{
        u32 err;

        /* Read PCI RC status and command register at 0x01FFC004 */
        if(pci_bus_read_config_dword(pdev->bus->parent, 0, 0x004, &err))
            return -EIO;

        //dbg("RC status value: 0x%08X\n", err);

        /* Return value of 0x00100000 indicates error */
        if(err == 0x00100000)
            return -EIO;
        else
            return SUCCESS;
}

 

But I guess calling this function every where is really inefficient and dirty workaround?

 

Do you have any idea what actually causes SoC / kernel to hang and how to prevent that?

 

We also saw: https://community.nxp.com/thread/304284#316162

where Charles Powe had similar problem with iMX6Q hanging itselft, but this thread is quite old and we would want to know if anything has been done to resolve this matter?

 

Is this problem related to ERR005184 or ERR005723?

 

Thanks for any suggestions and solutions to our problem!

 

Primoz

Outcomes