Hard Fault on branch instruction to _sched_execute_scheduler_internal for MQXLite

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Hard Fault on branch instruction to _sched_execute_scheduler_internal for MQXLite

Jump to solution
1,407 Views
samkreuze
Contributor II

We're using a MKE06Z128VLK4 to control industrial equipment. Our code was built using Kinetis Design Studio (3.2.0) with various Processor Expert (3.0.2.b151214) components and MQXLite (v1.1.1). 

The project has been going well for about a year and a half now with only a few solvable issues with KDS and PE; seems we've found our first issue with MQXLite. Recently we've noticed a hard fault when calling the lwevent_wait_ticks() function from our code. We've caught the hard fault a few times now using the hard fault handler and debugger.

The program counter points to the _sched_execute_scheduler_internal function each time.  We're able to trace through the stack and find where our code calls the lwevent_wait_ticks() function. Each time, its a different call to lwevent_wait_ticks() from a different task. Here is the stack of the task where the fault currently happened. I included the assembly from the objdump at the relevant locations in flash:

0x200013A8 0001218D
0x200013AC 0000E779
0x200013B0 200004E4
0x200013B4 20001088
0x200013B8 00000000
0x200013BC 00000001
0x200013C0 00000000
0x200013C4 20000058
0x200013C8 04692CEE
0x200013CC 046940B1
0x200013D0 00000000
0x200013D4 FFFFFFFF
0x200013D8 00000001
0x200013DC 00000000

0x200013E0 20001238  <-- stack pointer
0x200013E4 00000000
0x200013E8 0000042D
0x200013EC 0000052D
0x200013F0 200004E4
0x200013F4 0000E779 <--faulting instruction: e774: f003 fcce bl 12114 <_sched_execute_scheduler_internal>
0x200013F8 0000E77A
0x200013FC 00000000
0x20001400 00000001
0x20001404 00000001
0x20001408 1FFFF83C
0x2000140C 200004E4
0x20001410 20001238
0x20001414 2000145C
0x20001418 1FFFF844
0x2000141C 0000E7EF <-- e7ea: f7ff ff79 bl e6e0 <_time_delay_internal>
0x20001420 00000000
0x20001424 00000000
0x20001428 200004E4
0x2000142C 1FFFF83C
0x20001430 20001238
0x20001434 0000C68F <-- c68a: f002 f87d bl e788 <_time_delay_for>
0x20001438 2000145C
0x2000143C 00001770
0x20001440 00000000
0x20001444 200004E4
0x20001448 00001770
0x2000144C 0000C743 <-- c73e: f7ff ff43 bl c5c8 <_lwevent_wait_internal>
0x20001450 00000000
0x20001454 00000001
0x20001458 00000000
0x2000145C 00001770
0x20001460 00000000
0x20001464 00000000
0x20001468 00000000
0x2000146C 00000000
0x20001470 00000000
0x20001474 1FFFF83C
0x20001478 1FFFFB9C
0x2000147C 1FFFF860
0x20001480 00000000
0x20001484 0000B4E1 <-- our code now b4dc: f001 f900 bl c6e0 <_lwevent_wait_ticks>
0x20001488 0000006A

0x2000148C 00000000
0x20001490 00000000
0x20001494 00000000
0x20001498 00000000
0x2000149C 0000B78F <-- b78a: f7ff fe99 bl b4c0 <RunSpeedTask>
0x200014A0 00000000
0x200014A4 0000E201
0x200014A8 00000000
0x200014AC 00000000
0x200014B0 00000000
0x200014B4 00000000
0x200014B8 00000000
0x200014BC 00000000

The link register is 0xfffffffd which indicates we're using the PSP. The PSP is set to 0x200013e0. 

The MMAR, BFAR, and PSR are all 0. The DFSR is 2, indicating we're at a breakpoint (which we are).

There is plenty of stack left on all tasks and the interrupt stack.

This fault happens at random times when running the equipment (doesn't happen when its sitting idle). The scheduler obviously runs many times before this fault with no issues and so does our tasks which use the lwevent_wait_ticks(). 

We noticed that the MKE06 page doesn't link to any of these tools anymore so we're not sure what to make of that. At this point in the project, changing RTOS's is not really an option.

Is this a known issue with the mqx lite scheduler? 

1 Solution
995 Views
samkreuze
Contributor II

I found an issue with how I was initializing a lightweight message queue. 

#define INPUT_STATE_NUM_MESSAGES NUM_INPUTS + 1
//Size of message struct rounded up to the next 32bit word
#define INPUT_STATE_MSG_SIZE (sizeof(inputState_t) % 4 == 0 ? sizeof(inputState_t) / 4 : \
sizeof(inputState_t) / 4 + 1)

STATIC uint32_t m_inputStateMSGQueue[sizeof(LWMSGQ_STRUCT) / sizeof(uint32_t)
+ (INPUT_STATE_NUM_MESSAGES * INPUT_STATE_MSG_SIZE)];

result = _lwmsgq_init((void *) m_inputStateMSGQueue, INPUT_STATE_NUM_MESSAGES, INPUT_STATE_MSG_SIZE);

The macro INPUT_STATE_NUM_MESSAGES was defined without parentheses so when I declare m_inputStateMSGQueue, the size of the queue is determined by NUM_INPUTS + 1 * INPUT_STATE_MSG_SIZE instead of (NUM_INPUTS + 1) * INPUT_STATE_MSG_SIZE

To see if this was actually the issue, I changed INPUT_STATE_NUM_MESSAGES * INPUT_STATE_MSG_SIZE to 0.  I was expecting the lockup to occur as soon as I use the queue but I have yet to see it. 

So I guess I'm left to assume that this is the issue and hope that the lockup does not happen again. 

View solution in original post

5 Replies
996 Views
samkreuze
Contributor II

I found an issue with how I was initializing a lightweight message queue. 

#define INPUT_STATE_NUM_MESSAGES NUM_INPUTS + 1
//Size of message struct rounded up to the next 32bit word
#define INPUT_STATE_MSG_SIZE (sizeof(inputState_t) % 4 == 0 ? sizeof(inputState_t) / 4 : \
sizeof(inputState_t) / 4 + 1)

STATIC uint32_t m_inputStateMSGQueue[sizeof(LWMSGQ_STRUCT) / sizeof(uint32_t)
+ (INPUT_STATE_NUM_MESSAGES * INPUT_STATE_MSG_SIZE)];

result = _lwmsgq_init((void *) m_inputStateMSGQueue, INPUT_STATE_NUM_MESSAGES, INPUT_STATE_MSG_SIZE);

The macro INPUT_STATE_NUM_MESSAGES was defined without parentheses so when I declare m_inputStateMSGQueue, the size of the queue is determined by NUM_INPUTS + 1 * INPUT_STATE_MSG_SIZE instead of (NUM_INPUTS + 1) * INPUT_STATE_MSG_SIZE

To see if this was actually the issue, I changed INPUT_STATE_NUM_MESSAGES * INPUT_STATE_MSG_SIZE to 0.  I was expecting the lockup to occur as soon as I use the queue but I have yet to see it. 

So I guess I'm left to assume that this is the issue and hope that the lockup does not happen again. 

995 Views
danielchen
NXP TechSupport
NXP TechSupport

Thank you for your update.

Regards

Daniel

0 Kudos
995 Views
samkreuze
Contributor II

I've been testing my commits to determine if this issue has been there all along and no one had noticed or it was something that I possibly introduced. Its looking like the later. Unfortunately, it takes a whole day of testing to determine if the issue is present in a single commit so it has been a long process. I think I have narrowed it down to a change I implemented a while back with a lightweight message queue.

I'm using a queue for a task to handle changing outputs. Previously the code was:

error = _lwmsgq_send((void *) m_OutputCmdQueue, (uint32_t*) cmdMem, LWMSGQ_SEND_BLOCK_ON_FULL);
//check error condition
if (error != MQX_OK)
{
   LOG(LOG_CRIT_MSG, MSG_SEND, error, "Error sending lwmessage\r\n");
   return 1;
}

I had concerns about the queue filling up so I decided to implement it differently to know if this was happening. I had it check the return to see if the queue was full and only block if it was after logging a warning:

error = _lwmsgq_send((void *) m_OutputCmdQueue, (uint32_t*) cmdMem, 0);
if( error == LWMSGQ_FULL)
{
  /* If the queue is full, log a warning and try again but block until there is room */
  LOG(LOG_WARNING_MSG, MSG_SEND, error, "Output queue full\r\n");
  error = _lwmsgq_send((void *) m_OutputCmdQueue, (uint32_t*) cmdMem, LWMSGQ_SEND_BLOCK_ON_FULL);
}

if (error != MQX_OK)
{
    LOG(LOG_CRIT_MSG, MSG_SEND, error, "Error sending lwmessage\r\n");
    return 1;
}

This code is executed many times before the hardfault happens. I'm not sure why this would cause the hard fault. Is there anything wrong with this implementation?

I have alternatives if this is indeed the problem but I want to be sure this is actually the problem.

995 Views
samkreuze
Contributor II

Looks like this did not actually solve the hard fault, but it appears to make it happen less often. We're still trying to determine if this hard fault is originating from our code or from the RTOS or perhaps even the silicon.

0 Kudos
995 Views
danielchen
NXP TechSupport
NXP TechSupport

Hi Sam:

I copied your code to my project and run with FrdmK64 board for several hours, and not found this issue. I have some suggestions.

1  check the task/interrupt stack overflow. Though you already checked it is OK, I still suggest you double confirm.

2 The other consideration is, if the MQX kernal data is spoiled. But it is difficult to locate. One possible way is to mask one task to see if this happens. If a task is masked(not execute), and this issue is not seen, need to analyze the code in this task. If you have N tasks, each time, mask only one task, and to check if this issue is still seen.

3 since this issue happens only when the equipment is running, I would suggest you try to find the possible task which would responding to that equipment, and mask the task.

Regards

Daniel

0 Kudos