strange hard fault

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Tue Jan 24 18:22:58 MST 2012
hello,
I just got a strange hard fault from my device:
call stack:


TCL Debug [C/C++ MCU Application]
MCU GDB Debugger (1/25/12 2:17 AM) (Suspended)
Thread [1] (Suspended: Signal 'SIGSTOP' received. Description: Stopped (signal).)
7 HardFault_Handler() cr_startup_lpc176x.c:356 0x000000f6
6 <signal handler called>()  0xfffffff1
5 main_loop() main.c:687 0x00006f4c
4 RIT_IRQHandler() RITtimer.c:45 0x00000250
3 <signal handler called>()  0xfffffff9
2 c_entry_main() main.c:381 0x00005c9e
1 main() main.c:1185 0x00008578
arm-none-eabi-gdb (1/25/12 2:16 AM)
/home/luke/Desktop/moje/work/workspace/TCL/Debug/TCL.axf (1/25/12 2:17 AM)

Now the code that caused the fault (line 687 in main.c)

// adjust pulses_mod_* for speed
pulses_mod_front = fabs(ceil(1.0 / (gps_data.speed * bolts_multiplier_front)));
pulses_mod_rear = fabs(ceil(1.0 / (gps_data.speed * bolts_multiplier_rear)));

if(pulses_mod_front > PULSES_MAX_SIZE_M1) pulses_mod_front = PULSES_MAX_SIZE_M1;
if(pulses_mod_rear > PULSES_MAX_SIZE_M1) pulses_mod_rear = PULSES_MAX_SIZE_M1;

// read wheels
pulses_front_current = FIO_ByteReadValue(1, 2);
pulses_rear_current = pulses_front_current & 0x0f;
pulses_front_current = pulses_front_current & 0xf0;
pulses_front_current = pulses_front_current >> 4;

// calculate the change since last read (front)
index = i % pulses_mod_front;
pulses_front[index] = pulses_front_current - pulses_front_previous; // <-- HARD FAULT HERE!

Now I can give you the values of the variables and it seems impossible to get this error, as:
1) gps_data.speed = 4.6148641819514561e-08
2) bolts_multiplier_front = 0.01

So pulses_mod_front should be positive, but it reads "-2128056287" instead.
This leads to huge value of "index", which is out of bounds for "pulses_front".

What's going on here? I don't use dynamic memory anywhere in my program, so what can be wrong here? Is the board broken?

I'm on lpcxpresso1769 btw.

regards,
--
Luke

anup_gandra · ‎10-31-2022

Facing similar issue like w.r.t hardfault and wrong memory access/broken memory kind of with MKV56 controller.

Did we have any solution/RCA for this issue by any chance?

javiervallori · ‎07-02-2021

Hi,

I know this is an old post, but I proceed to post my solution, maybe it is usefull for somebody.

I've had similar problems in a project I am working on. In my case, I get the HardFault issue, when an CAN IRQ start within a RIT IRQ.

I was working on a cpp project, and the CAN IRQ callback function was on a cpp file. I miss to include the compiler directive:

#if defined(__cplusplus)
extern "C" {

callback_function_declaration()

#if defined(__cplusplus)
}
#endif /* __cplusplus */

This is obviuly a mistake, due to the fsl_mcan driver is written in C, and the function callings in C work in a diferent way than in C++.

Unlucky, the call work with one level IRQ, making me getting a little be lost in the issue. But if the function call is within a second IRQ level, I guess something gets corrupt on the IRQ stack and then you get the HardFault.

Once added the "extern "C"" directive, everything works well.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sun Jan 29 07:47:46 MST 2012
Ok. I think I sorted out the hard fault - there was one error in the UartReceive function that could result in a memory being overwritten. I no longer get SIGSTOPs or hard faults.

But I still get an error which is strange to me... but I think I will describe it in a new thread.

edit: the new thread is here

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sat Jan 28 10:11:35 MST 2012
SIGSTOP again... so it was real. No useful stacktrace information available.


TCL Debug [C/C++ MCU Application]
MCU GDB Debugger (1/28/12 4:58 PM) (Suspended)
Thread [1] (Suspended: Signal 'SIGSTOP' received. Description: Stopped (signal).)
2 <symbol is not available> 0x1fff0080
1 <symbol is not available> 0xffffffff
arm-none-eabi-gdb (1/28/12 4:58 PM)
/home/luke/Desktop/moje/work/workspace/TCL/Debug/TCL.axf (1/28/12 4:58 PM)

Disassembly at 0xffffffff - Unable to retrieve disassembly data from backend.
Disassembly at 0x1fff0080 - below

[...]
1fff0074:                    movs r0, r0
1fff0076:                    movs r0, r0
1fff0078:                    movs r0, r0
1fff007a:                    movs r0, r0
1fff007c:                    movs r0, r0
1fff007e:                    movs r0, r0
1fff0080:                    ldr.w r4, [pc, #24]     ; 0x1fff009c
1fff0084:                    ldr.w r5, [pc, #16]     ; 0x1fff0098
1fff0088:                    ldr r6, [r4, #0]
1fff008a:                    and.w r6, r5, r6
1fff008e:                    str r6, [r4, #0]
1fff0090:                    ldr.w pc, [pc]  ; 0x1fff0094
1fff0094:                    lsls r1, r0, #8
1fff0096:                    subs r7, r7, #7
1fff0098:                    itttt <und>
1fff009a:                     ; <UNDEFINED> instruction: 0xffffc3c0
1fff009e:                    and<und> r7, r1
1fff00a0:                    mov<und> r0, r0
1fff00a2:                    mov<und> r0, r0
1fff00a4:                    movs r0, r0
1fff00a6:                    movs r0, r0
1fff00a8:                    movs r0, r0
1fff00aa:                    movs r0, r0
1fff00ac:                    movs r0, r0
[...]

I can't make anything out of this... Is this a hard fault anyway? I only see 'suspended' in the debug console...

I will try this maybe (Extending the hard fault handler): http://support.code-red-tech.com/CodeRedWiki/DebugHardFault

Where is the 'Core Register view'? Also I see no 'VECTPC'

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sat Jan 28 08:54:08 MST 2012
I think I've isolated the problem. My software works for over 1.5 hours with no faults now, so I hope I know where the problem is. And where does not mean what yet...

As I wrote before I have 2 interrupts UART0 and RIT. Priority of UART0 interrupt is higher. The Rit ISR calls a function that is most probably interrupted by the UART0 interrupt, which is ok theretically...
The place it could be causing trouble is in my main_loop (called from RIT ISR) where I read the contents of the UART0 buffer (being filled by the UART0 ISR). So if the UART0 ISR kicks in in the middle of RIT ISR reading the buffer something happens...

This is a theory, because now I turned on both the interrupts and only commented out reading of the buffer in the RIT ISR and it works for almost 2 hours now without any problems.

I have no idea what could be happening inside to mess with the memory...
Again, theoretically on function writes to the memory where the other reads, nothing bad should happen... I use the UART interrupt example from the cmsis1.3 library with minor changes. In the RIT ISR (main_loop) I call UARTReceive and in the UART0 ISR UART_IntReceive is called.

Any suggestions as what can be happening and how to fix it?

I had an idea that maybe making those 2 interrupts same priority could help, but them the RIT ISR take a lot time, so I might miss some data from the UART0...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sat Jan 28 05:31:02 MST 2012
This is getting really odd. I disabled the uart interrupt for a test, so only one interrupt was active. After an hour or so I look at the development environment and see SIG-Stop reseived. It's just like I would suspend the run, but I haven't touched anything. There was also no dissasembly nor stack trace information. When I resumed the program the application was restarted...

Edit:
Lets pretend the above didn't happen - I have some external factors that might have influenced the process (namely my kids) - sorry about that :)

It seems that disabling the uart interrupt stops the problem from happening. But I don't understand why. The uart isr is extremely short. It has the highest priority but so, as I understand it, it can start whenever it needs to, so in the middle of main_loop. The thing I don't understand is how would it break anything? It only writes some data to one place in memory - a buffer.

I must misunderstand some basic concept here or I'm missing a detail somewhere...

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sat Jan 28 04:07:48 MST 2012
Well, I still can't see how that index could grow so much. I do check for it not beeing bigger than the index. I basically do i % size_of_array_minus_1... (actually I see now that the _minus_1 part is not necessary...)
And I don't see how my memory would exceed the limits of lpc as I don't use dynamic memory. Everything is statically defined at the beginning... no extra variables in if's or functions.

I'm not sure you want me to molest you with all my code - it's pretty big now. The RIT ISR is in deed big. I'm going to switch to freertos eventually but now I want to understand exactly the interrupts and priorities so I can use them properly.
I can paste fragments here that I think are most important. If you want/need to see some part please tell me and I will put it here.

For now I think it's important to say that I have to ISR's right now. One RIT irs - the fragment is posted below and one UART receive ISR.
The priorities of interrupts I defined as:

    // interrupt priorities
    #define RITINT_PRIORITY20
    #define UART0INT_PRIORITY15

[...]

    NVIC_SetPriority(UART0_IRQn, UART0INT_PRIORITY);//((0x01<<3)|0x01));
    NVIC_EnableIRQ(UART0_IRQn);

[...]

    NVIC_SetPriority(RIT_IRQn, RITINT_PRIORITY);
    NVIC_EnableIRQ(RIT_IRQn);

So the uart has higher priority.
Besides all that I have while(1) loop in the main function - which I think should be executed in free time between interrupts. It does virtually nothing right now - bunch of if's and one flag setting for the main_loop interrupt.

I would be really great if someone could help get through this

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Rob65 on Sat Jan 28 01:42:07 MST 2012
It's a bit hard without seeing all the code, but there are a few things that trigger me to answer to this.

It seems like you are prepared for interrupt overruns - this means that there could be a lot of code in your main_loop() - which may not be a good thing to do.
Looking at the code section from your main_loop() I see at least fabs(), ceil() and some (floating point ?) divisions and multiplications.

These are all software functions that take some time to execute.
Is it really necessary to do all of this inside an interrupt routine - and are you sure you do need this to be floating point math? Most of the time I am able to solve my floating point problems using fixed point 64 bits integer math.

Ah ... just located the "HARD FAULT HERE" in your original post.
Which of course is logical with a large index.
When calculating indexes, always make sure that your index is within range before using it. Although this does not fix your problem with the stange math result, it does prevent the hard fault.

If you single step through the assembly code you will see how the index and the base address of the array are used to calculate the final address that will be used.

Let's day you have an array of uint32_t that start at address 0x01000040.
Now if you use an index of 12000, the final address being read or written is:
0x01000040 + (4 * 12000) = 0x01000040 + 0xbb80 = 0x0100bbc0.
Since this is not within memory available to the lpc1769, this will result in a hard fault interrupt.

Worse ... if the index is within range, you will get an address that is not within your array but somewhere else in memory overwriting another variable being used.

Rob

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Sat Jan 28 01:07:50 MST 2012
After some checks I still don't know what's causing the problem.
I'm using a 'not typical' interrupt handling in my program, but I don't understand how would that hurt the memory...
Please help me understand.

My RIT_IRQHandler() looks liks this:
[Code]
void RIT_IRQHandler (void)
{
LPC_RIT->RICTRL |= (0x1<<0);/* clear interrupt flag */
//rit_timer_counter++;

if(main_loop_working) return;
main_loop_working = 1;
main_loop();
main_loop_working = 0;
return;
}
[/code]
The code causing the hard fault is in the main_loop();
Also the error happens randomly - as it is with memory problems. Sometimes the application can work for an hour without any errors. Sometimes it crashes after a few minutes.
I also notices that sometimes there is no hard-fault but the code stucks in the I2C waiting for response from my IMU board.

If you need any more clarification please ask.

PS: I'm pretty sure that main_loop finishes before the next rit isr, as it only takes about 2ms to run it and rit timer isr is called every 10ms. But what would happen if it didn't finish on time?

lpcware · ‎06-15-2016

Content originally posted in LPCWare by dragilla on Tue Jan 24 19:11:35 MST 2012
Ok. I think it's a interrupt issue. Nevermind the question.