MCF5445x Linux, interrupts cause programs to crash

KyleFromBullex · ‎10-02-2012

Hello everyone,

I am developing a custom driver in Linux for the MCF54450 processor. This driver uses a PIT timer to trigger an interrupt roughly 2,000 times per second. However, this seems to be causing problems for userspace programs, specifically programs that use the execve() system call. Every once in a while, a program will crash while attempting to call execve(). It seems to be related to the driver and its interrupt, because removing the driver causes the bug to stop. The bug can be demonstrated by installing the driver, which starts the interrupt, and then running a program which simply uses execve() to call itself, resulting in an infinite loop of process creation. I've tried slowing down the interrupt, but the crash still occurs.

I seem to remember something like this happening before, with programs mysteriously crashing, but it never happened as frequently as it does with this new driver. Has anyone encountered a similar issue? I'm running kernel 2.6.38 with Freescale patches and some custom in-house patches for bug fixes and a GPIO driver. Thanks in advance.

Kyle

TomE · ‎10-03-2012

What is the crash? Can you run the program under a debugger? Does it leave a stack-dump file you can inspect? Can you turn more debugging on to find out what the crash is?

execve() is pretty complicated and can return an error code. Is the caller checking for an error return from this function (and then printing that and errno)?

It is likely the ISR is trashing some registers (or other resources) and isn't saving and restoring the environment properly. Is the ISR completely self contained or is it calling some other functions that might not be interrupt sage.

Tom

KyleFromBullex · ‎10-04-2012

I ran the program with core dumps enabled, and I got a core dump file, but I can't get GDB to read it. It says, "GDB can't read core files on this machine." I'm using the CodeSourcery x86 GDB for m68k. I've tried compiling GDB for m68k, but haven't been able to get it to compile.

The test program is set to print out the return code from execve(), but it segfaults before this ever occurs. With regards to the ISR, I've tried running an empty interrupt routine, and it still causes the same problems. I'm currently trying to do a binary search on the kernel's execve() routines using printk()'s, thinking that maybe the ISR is interrupting a specific block of code.

Any ideas on how to more efficiently debug the kernel, or get GDB to read the core dump? Thanks.

Kyle

TomE · ‎10-04-2012

> I can't get GDB to read it.

Are you running gdb on the HOST or the TARGET? If you have a gdb built to run on the Target it should work. If you have that you can run your program under GDB (but probably not if it spends its time launching other programs that crash).

If you're trying to run gdb on the host, then it has to be the version compiled for that target. For instance the one I run is called "m68k-elf-gdb". Your toolchain should provide something similar.

What sort of File System are you running? It might be that something in the Flash Filesystem or Flash driver doesn't like getting interrupted.

Do you have a temporary filesystem - a "tempfs" that is mounted on RAM? Can you have your program run from and launch programs from that filesystem to try and eliminate the FLASH? How about from an external USB stick?

> Any ideas on how to more efficiently debug the kernel,

Possibly

KGDB - Wikipedia, the free encyclopedia

Tom

KyleFromBullex · ‎10-05-2012

Alright, I've used some manual debugging methods to determine that it isn't the execve() system call itself that's causing the crash. I was also able to compile GDB for the target and read a sample of 30 core dumps. There seem to actually be two different crashes, but both take place in /lib/ld.so.1, the dynamic loader, and both of them result from accesses to the .dynamic section of an ELF. One of the crashes happens when the loader attempts to access its own .dynamic section, and the other happens when it attempts to access that of /lib/libc.so.6. In both cases, the guilty instruction attempted to add either %d0 or %d1 to (%a0+4).

In short, the crash is happening inside the dynamic loader, which explains why calls to execve() tend to cause the crash, since the dynamic loader gets run immediately after this system call. I'll be researching this some more and examining the other registers shortly.

To answer your questions, we are running an ext2 filesystem on a MicroSD card over an SPI bus. I'll try running it from RAM and see if that helps. Thanks for all your help, Tom.

Edit: I tried running the whole system from an initrd (RAM disk), with no SD card. It still crashed, so I don't think it's related to SD card I/O.

Kyle

TomE · ‎10-07-2012

Check the Spurious Interrupt Errata for that chip and see if you prove the kernel code is doing the right thing.

I'm using an MCF5329, and in this CORE3 chip the stack pointers don't work properly. The workarounds are "don't use the User Stack Pointer", or special code has to be added before every RTE instruction.

To run a real OS you really want separate stack pointers for user and kernel code.

The V4 Core doesn't have this problem, but it might be possible there's some leftover code in the Kernel to handle this sort of stack pointer bug in other Coldfire chips. Make sure it is using both stack pointers if you can.

You might be getting some sort of stack corruption or overflow. Can you make the kernel and user stacks bigger and see if the problem goes away?

Is the driver (that is causing the crash) doing anything silly like allocating a large data structure or array on the stack? Is it allocating any kernel memory? Is it doing things it should be getting a lock for first? Can you run the interrupt at a higher or lower IPL (try running it at IPL6 and IPL1 if you can) and see if it changes. Lower IPLs might stop it from interrupting something sensitive and higher IPLs might stop it from getting interrupted.

Tom

KyleFromBullex · ‎10-08-2012

I checked the spurious interrupt errata for the 5445x, but I haven't seen any spurious interrupts while I've been developing with this processor, so I don't think this is the problem. I also tried making the user stack much bigger, but the problem didn't go away. I tried running the interrupt at levels 0, 1, 6, and 7, and in all cases the test program still crashed.

The driver doesn't allocate any memory; all of its data is stored in its own .data and .bss sections. The interrupt does nothing more than check the status of 11 GPIO pins, put the status in memory, and exit.

I've briefly looked over the low-level interrupt and context-switching code, and it seems to be making use of both stack pointers. I'll look into it further. It seems most likely to me that the low-level interrupt code is trashing the registers.

Kyle

KyleFromBullex · ‎10-12-2012

Update:

I received a kernel patch from Freescale which fixed the issue. The problem was related to memory management. Thanks for your help, Tom. Below is the patch for anyone who wants it.

diff -Nurp linux-2.6.38/arch/m68k/coldfire/common/entry.S linux-2.6.38-f1/arch/m68k/coldfire/common/entry.S

--- linux-2.6.38/arch/m68k/coldfire/common/entry.S 2012-02-17 12:36:12.357418000 +0800

+++ linux-2.6.38-f1/arch/m68k/coldfire/common/entry.S 2012-02-16 11:12:19.065416140 +0800

@@ -43,7 +43,6 @@

* TIF_SYSCALL_TRACE 15

* TIF_MEMDIE 16 (never checked here)

*/

-

.bss

sw_ksp:

@@ -68,7 +67,11 @@ ENTRY(buserr)

#ifdef CONFIG_COLDFIRE_FOO

movew #0x2700,%sr /* lock interrupts */

#endif

+ movew #0x2700,%sr

SAVE_ALL_INT

+ move.w 54(%sp),%d3

+ ori.l #0x2000,%d3

+ move.w %d3,%sr

#ifdef CONFIG_VDSO

jsr check_vdso_atomic_cmpxchg_32

#endif

diff -Nurp linux-2.6.38/arch/m68k/coldfire/common/ints.c linux-2.6.38-f1/arch/m68k/coldfire/common/ints.c

--- linux-2.6.38/arch/m68k/coldfire/common/ints.c 2012-02-17 12:36:12.149418000 +0800

+++ linux-2.6.38-f1/arch/m68k/coldfire/common/ints.c 2012-02-01 16:18:35.345417997 +0800

@@ -384,6 +384,7 @@ void m547x_8x_irq_enable(unsigned int ir

if ((irq > 0) && (irq < 8)) {

/* enable eport */

MCF_EPPAR &= ~(3 << (irq*2));

+ MCF_EPPAR |= (2 << (irq*2)); /* Edge */

/* level */

MCF_EPDDR &= ~(1 << irq);

/* input */

diff -Nurp linux-2.6.38/arch/m68k/include/asm/cf_548x_cacheflush.h linux-2.6.38-f1/arch/m68k/include/asm/cf_548x_cacheflush.h

--- linux-2.6.38/arch/m68k/include/asm/cf_548x_cacheflush.h 2012-02-17 12:36:12.709418001 +0800

+++ linux-2.6.38-f1/arch/m68k/include/asm/cf_548x_cacheflush.h 2012-02-17 13:27:10.089418011 +0800

@@ -286,6 +286,7 @@ static inline void copy_from_user_page(s

struct page *page, unsigned long vaddr,

void *dst, void *src, int len)

{

+ flush_dcache();

memcpy(dst, src, len);

}

diff -Nurp linux-2.6.38/arch/m68k/include/asm/entry_mm.h linux-2.6.38-f1/arch/m68k/include/asm/entry_mm.h

--- linux-2.6.38/arch/m68k/include/asm/entry_mm.h 2011-03-15 09:20:32.000000000 +0800

+++ linux-2.6.38-f1/arch/m68k/include/asm/entry_mm.h 2012-02-16 11:19:08.645427138 +0800

@@ -65,6 +65,8 @@ LFLUSH_I_AND_D = 0x00000808

* that the stack frame is NOT for syscall

*/

.macro save_all_int

+ movel MMUSR,%sp@-

+ movel MMUAR,%sp@-

clrl %sp@- | stk_adj

pea -1:w | orig d0

movel %d0,%sp@- | d0

@@ -72,6 +74,8 @@ LFLUSH_I_AND_D = 0x00000808

.endm

.macro save_all_sys

+ movel MMUSR,%sp@-

+ movel MMUAR,%sp@-

clrl %sp@- | stk_adj

movel %d0,%sp@- | orig d0

movel %d0,%sp@- | d0

@@ -83,6 +87,7 @@ LFLUSH_I_AND_D = 0x00000808

movel %sp@+,%d0

addql #4,%sp | orig d0

addl %sp@+,%sp | stk adj

+ addql #8,%sp

rte

.endm

juliengrossholt · ‎11-22-2013

I can confirm this bug : I also had it on my 2.6.25 kernel when a user space program was using the system() call to start another program. Thank you for you post with this patch it has fixed my issue.

If it can help anybody else here is the patch adapted from the previous one for my uclinux 2.6.25 kernel :

diff --git a/arch/m68k/coldfire/entry.S b/arch/m68k/coldfire/entry.S

index 1aaca64..7d997aa 100755

--- a/arch/m68k/coldfire/entry.S

+++ b/arch/m68k/coldfire/entry.S

@@ -82,7 +82,12 @@ ENTRY(buserr)

#ifdef CONFIG_COLDFIRE_FOO

movew #0x2700,%sr /* lock interrupts */

#endif

+ movew #0x2700,%sr

SAVE_ALL_INT

+ move.w 54(%sp),%d3

+ ori.l #0x2000,%d3

+ move.w %d3,%sr

+

#ifdef CONFIG_COLDFIRE_FOO

movew PT_SR(%sp),%d3 /* get original %sr */

oril #0x2000,%d3 /* set supervisor mode in it */

diff --git a/arch/m68k/coldfire/ints.c b/arch/m68k/coldfire/ints.c

index 461b96a..bb7f84c 100755

--- a/arch/m68k/coldfire/ints.c

+++ b/arch/m68k/coldfire/ints.c

@@ -485,16 +485,15 @@ void m547x_8x_irq_enable(unsigned int irq)

irq -= 64;

/* JKM -- re-add EPORT later */

-#if 0

/* check for eport */

if ((irq > 0) && (irq < 8)) {

/* enable eport */

MCF_EPPAR &= ~(3 << (irq*2)); /* level */

+ MCF_EPPAR |= (2 << (irq*2)); /* Edge */

//MCF_EPORT_EPDDR &= ~(1 << irq); /* input */

MCF_EPDDR &= ~(1 << irq); /* input */

MCF_EPIER |= 1 << irq; /* irq enabled */

}

-#endif

if (irq < 32) {

/* *grumble* don't set low bit of IMRL */

diff --git a/include/asm-m68k/cf_548x_cacheflush.h b/include/asm-m68k/cf_548x_cacheflush.h

index 9a529e8..32bf37d 100755

--- a/include/asm-m68k/cf_548x_cacheflush.h

+++ b/include/asm-m68k/cf_548x_cacheflush.h

@@ -241,6 +241,7 @@ static inline void copy_to_user_page(struct vm_area_struct *vma,

struct page *page, unsigned long vaddr,

void *dst, void *src, int len)

{

+ flush_dcache();

memcpy(dst, src, len);

flush_icache_user_page(vma, page, vaddr, len);

}

diff --git a/include/asm-m68k/entry.h b/include/asm-m68k/entry.h

index f8f6b18..7346f6b 100755

--- a/include/asm-m68k/entry.h

+++ b/include/asm-m68k/entry.h

@@ -71,6 +71,8 @@ PT_DTRACE_BIT = 2

* that the stack frame is NOT for syscall

*/

.macro save_all_int

+ movel MMUSR,%sp@-

+ movel MMUAR,%sp@-

clrl %sp@- | stk_adj

pea -1:w | orig d0

movel %d0,%sp@- | d0

@@ -78,6 +80,8 @@ PT_DTRACE_BIT = 2

.endm

.macro save_all_sys

+ movel MMUSR,%sp@-

+ movel MMUAR,%sp@-

clrl %sp@- | stk_adj

movel %d0,%sp@- | orig d0

movel %d0,%sp@- | d0

@@ -89,6 +93,7 @@ PT_DTRACE_BIT = 2

movel %sp@+,%d0

addql #4,%sp | orig d0

addl %sp@+,%sp | stk adj

+ addql #8,%sp

rte

.endm

MCF5445x Linux, interrupts cause programs to crash

MCF5445x Linux, interrupts cause programs to crash

General