M5474 Seg Fault Issues

kevin_curtis · ‎08-22-2011

Hi,

we have a M5474 Coldfire based system with a synchronous serial port connected via the Flex Bus. We are using the Ltib toolset from ltib-cflinux-2010019 to build our rootfs. We have a number of custom processes running on the board. Some of them make system() calls to execute commands, for example

system("ps -eaf > /tmp/process_list.log");

This system command is executed on a regular basis. We started noticing stability problems where processes that made these calls would terminate with a segfault.

After a lot of debugging time we are sure that the problem is related to the interrupts to the coldfire processor generated by the synch serial port that we are using. The system will run fine while while there are no sync serial interrupts.

We have narrowed this down further by writing a simple program that repeatedly calls the above system command every second or so, and a very simple driver that claims the interrupt. The interrupt is now being driven from an external clock source at a rate of about 1 every 16ms. The ISR makes a access to one of the sync serial port registers and then clears the interrupt at the coldfire level.

This very simple test setup is enough to cause a segfault in the test application, or one of the child processes forked to execute the ps command.

We have also noted that if we don't make any access in the ISR to the sync port registers then there isn't an issue.

We have seen references in the forum about calls to vfork() being an issue, but we believe that in this case the Kernel is using fork().

We think it is an issue with register access from within the ISR. But at the interrupt rates that we need to support, we don't think that deferring the interrupt processing to a bottom half is a viable solution.

Has anyone else had this issue?

Is there a know resolution that I have missed?

Thanks and Regards

Kevin

kevin_curtis · ‎08-23-2011

We have done some further testing and we now think that any flex bus access in an ISR will cause the segfault problem.

We have replaced accesses to the sync serial device with accesses to NOR flash instead and still get the same issue.

We are currently working on reproducing the issue on the FreeScale evaluation board that we purchased in order to make our product selection.

TomE · ‎08-24-2011

Is that running Linux or uClinux? In other words, is the MMU active. That looks like it could complicate finding out what is wrong.

I'm familiar with the MPC PPC CPUs, and with it when the CPU takes an interrupt the MMU is disabled. This means that your interrupts can run in a "flat unmapped memory space" unless you turn the MMUs back on.

With the MCF54 series CPUs it looks like the MMU is active all the time. That means that the I/O registers had better be permanently mapped in all circumstances (interrupting user-space code and interrupting kernel code).

I'd suggest you get the "ColdFire CF4e Core User's Manual" and read through the MMU chapter. I would hope that Linux is set up for your hardware with an ACR set up to map all of the peripheral registers so the MMU shouldn't get involved in these accesses.

Check all the other interrupt handlers in the system to see if there's any "magic" needed before accessing hardware registers. You may need to register some mapping for I/O areas before using them.

Maybe you could add some extra code to the Seg Fault handler to get it to print the CPU registers, a series of stack frames and any access error registers (so you can see the faulting program counter and access address). Decoding the stack frame should help to show why it is failing.

Is it possible the ISR is trashing the stack somehow? Maybe the ISR is using more Kernel stack than is being reserved for it.

Tom

kevin_curtis · ‎08-24-2011

Hi Tom.

Thank you for your suggestions. We will certainly follow them up.

Yes, we are using Linux. With a Kernel version of 2.6.29 from the Ltib package.

We are also using a remote gdb using the ColdFire USB BDM. So we can break at the point where the exception has been trapped. The offending address signalled by the MMU is always the same, but as yet we can't make any sense of it. The process that is our test process or one if it's child processes (sh or ps).

Our minimal driver ISR has two integers of local data, and just does the flex bus access.

The driver does a ioremap_nocache() call to get the base of the I/O space that be do the i/o read/write from.

I'll let you know what else we find.

Many thanks.

Kevin

kevin_curtis · ‎09-01-2011

Hi,

we have been working to see if we can reproduce this issue on the Freescale evaluation board. The details are as follows:

Hardware: LogicPD M5474LITEKIT
Software: LTIB-CFLINUX-20100919, 2.6.29 kernel
Standard M5474LITE build except :smileyinfo: network addresses changed, (ii) pciutils removed, (iii) PCI bus drivers removed.

A "signal generator" is sourcing a periodic ~1us interrupt once every 16ms. The interrupt is connected to the unused PCI INTA# on the PCI connector which in turn is connected directly to /IRQ7 on the 5474.

A dummy driver handles the interrupt and acknowledges it, scheduling a bottom half and does a read from the NOR flash in the bottom half.

At the same time a test program is periodically executing result = system("/bin/ps -A > /usr/logs/deamonstat.txt");

The Segfault occurs within 10 mins.

The profile of the Segfault on the evaluation board is slightly different to that in our target hardware. On the evaluation board the segfault occurs in the swapper process, where as on our target system the segfault always occurs in the test application or one of the processes that it creates to run the "system()" command.

This is still under investigation.

TomE · ‎09-01-2011

Debugging is meant to be easy under Linux. That's the "implied promise" that somehow gets forgotton on embedded systems.

Traditionally, after a segfault you enter the command "gdb core" and then ask gdb "where". Seeing the stackdump and calling tree can sometimes show the problem, or at least show the calling path that triggers the problem.

Except gdb isn't usually built for embedded systems, and then a lot of things conspire to prevent core files from being dropped into the file system. If you do get these working you don't have the sources on the embedded system, so an on-host gdb will have a few problems. That's easy, you just "objdump -S" the binaries and go from there, or "objdump -S vmlinux | less" if the segfault was in the kernel.

The way around that is to build "gdbserver" for the target and then connect to that from a gdb built to run on the build machine.

If you can get a consistent crash (like you can on your board) then run your program under gdb/gdbserver. Otherwise try to get it to dump a core. This may require configuration of any or all of the Kernel, Shell (BusyBox usually - it has to be built with CONFIG_FEATURE_INIT_COREDUMPS), Libraries and then you might have to play with stuff in /proc or ulimit.

Refer to here for some details:

http://www.open-mesh.org/wiki/batmand/Coredump

I'm currently adding printk() calls to various bits of the kernel and slab memory management to try and track down intermittent crashes.

Tom (A Random Poster)

angelo_d · ‎09-03-2011

hello kevin,

i would suggest some hints:

1) exclude completely linux and your hardware, write a simple C program and test in in the development board. See if the interrupt still cause issues, i don't know that chip but you can probably debug with some leds, or with bdm.

2) if you still get strange behavior check if there is some errata pdf out there, i had an issue with MCF5307, coming crazy to understand why i was getting an exception returning from an interrupt. Issue was for an error in the mask, reported in the errata. This is probably not your case but better check.

3) Since you get issues on both hardware, is the connected device on the serial port reliable ?

4) I use uClinux with the suggested toolchain to compile it, it run on mcf5307, it is very stable, meybe you can try it.

regards,

angelo

M5474 Seg Fault Issues

M5474 Seg Fault Issues

General