(Originally titled "RT1050 on SDK 2.6.1 - lwIP and MQTT cause sporadic LWIP_ASSERT" -- see reply below)
I'm currently developing an Ethernet-based application on the RT1050 on a custom board. I have the PHY working, lwIP runs, UDP and TCP work fine, and I have the lwIP MQTT client connecting to both a public test server and to a custom broker running on a PC. All seems well...
... except when I let the client run for a while and come back to find that the client has locked up because lwip hit an assert. It's pretty random... I've seen it happen within five minutes, but the most recent test ran for about 70 minutes before it bugged out. And it's not asserting in the same place. The first time, it hit this assert in tcp_out.c, line 344, tcp_write_checks() called from tcp_receive():
LWIP_ASSERT("tcp_write: no pbufs on queue => both queues empty",
pcb->unacked == NULL && pcb->unsent == NULL);
The second one asserted in tcp_in.c, line 1111, tcp_free_acked_segments() called from tcp_receive():
LWIP_ASSERT("pcb->snd_queuelen >= pbuf_clen(next->p)", (pcb->snd_queuelen >= clen));
I'm running with data cache disabled globally as a workaround for multiple SDK issues (Ethernet, MMC, USB), so data caching shouldn't be an issue. My stacks are big and juicy:
No Handle Base Curr HiWByte State Name
17 800910dc 29 29 732 Blocked Tmr Svc
05 20207fb0 18 18 1388 Blocked led_cpu
06 20208618 18 18 1396 Suspended led_act
10 2020c9c8 14 14 2676 Blocked touch
09 2020bd70 13 13 4788 Blocked tcpip_thread
08 2020a778 12 12 4132 Blocked eth_mgr
07 202092c8 9 9 2068 Blocked gui
14 20217a88 8 8 7844 Blocked mqtt_01
15 2021e980 8 8 7860 Blocked mqtt_02
13 20210ac8 7 7 4900 Blocked mqtt_mgr
12 2020f290 6 6 4196 Suspended login_svr
11 2020de30 5 5 3656 Blocked disc_svr
01 202039e0 4 4 2812 Suspended debug_mgr
03 20206538 3 3 4716 Suspended test
04 20207948 2 2 4140 Running backgrnd
02 20204e00 1 1 4956 Ready system
16 80091064 0 0 760 Ready IDLE
OS heap: 135552/262144 bytes remaining
System heap: 13208/16384 bytes remaining
Anyone else run into similar issues? Any idea where I should start digging?
As an aside, I believe I fixed a small but critical issue with the MQTT client, which would trigger whenever a message greater than 128 bytes in size would be received. The issue is in line 683 of lwip/src/apps/mqtt/mqtt.c:
LWIP_ASSERT("client->msg_idx < MQTT_VAR_HEADER_BUFFER_LEN", client->msg_idx < MQTT_VAR_HEADER_BUFFER_LEN);
The assert will fail if client->msg_idx is greater than or equal to MQTT_VAR_HEADER_BUFFER_LEN, which by default is 128. But looking further down in the code, the condition of msg_idx being less than or equal to is tested for, which leads me to believe that the above line should actually read:
LWIP_ASSERT("client->msg_idx <= MQTT_VAR_HEADER_BUFFER_LEN", client->msg_idx <= MQTT_VAR_HEADER_BUFFER_LEN);
This seems to be the intended behavior, and it works in my code; leaving it as "less than" and then sending a message greater than (128 - topic length) in size causes an immediate assert.
解決済! 解決策の投稿を見る。
I think the kids today call this a "self-own". I did some digging, and now I'll lay out the case for why the SDK should have its own implementation of LWIP_ASSERT_CORE_LOCKED().
My issue is that I was not calling LOCK_TCPIP_CORE() before calling the lwIP MQTT API functions. I didn't immediately realize that core locking was required for this API; for instance, the lwip sockets API doesn't require explicit locking, as it's all handled inside the API. Because my target was lightly loaded on the network, the target would run fine for some period of time, exchanging messages with the MQTT server, but would eventually get itself out of sync. (Edit: after 24 hours of testing, this was definitely the issue.)
What's useful is that lwIP has a protection against just this very scenario: the macro LWIP_ASSERT_CORE_LOCKED(). This macro is called in just about every location where the lwIP code needs exclusive access to the TCP/IP core functions, including the MQTT API. Example:
void
mqtt_disconnect(mqtt_client_t *client)
{
LWIP_ASSERT_CORE_LOCKED();
LWIP_ASSERT("mqtt_disconnect: client != NULL", client);
/* If connection in not already closed */
if (client->conn_state != TCP_DISCONNECTED) {
/* Set conn_state before calling mqtt_close to prevent callback from being called */
client->conn_state = TCP_DISCONNECTED;
mqtt_close(client, (mqtt_connection_status_t)0);
}
}
The macro is documented and defined by default in lwip/src/include/lwip/opt.h:
/**
* Macro/function to check whether lwIP's threading/locking
* requirements are satisfied during current function call.
* This macro usually calls a function that is implemented in the OS-dependent
* sys layer and performs the following checks:
* - Not in ISR (this should be checked for NO_SYS==1, too!)
* - If @ref LWIP_TCPIP_CORE_LOCKING = 1: TCPIP core lock is held
* - If @ref LWIP_TCPIP_CORE_LOCKING = 0: function is called from TCPIP thread
* @see @ref multithreading
*/
#if !defined LWIP_ASSERT_CORE_LOCKED || defined __DOXYGEN__
#define LWIP_ASSERT_CORE_LOCKED()
#endif
As you can see, if it's not otherwise defined, the macro does nothing. If it had been defined, I'm sure I would have noticed my issue a lot sooner. Thankfully, it's not very difficult to implement it. First, we need a function that will assert if the core is not locked. I put this code in lwip/port/sys_arch.c near the top:
#if LWIP_TCPIP_CORE_LOCKING
extern sys_mutex_t lock_tcpip_core;
void lwip_assert_core_locked(void)
{
/* If the mutex hasn't been initialized yet, then give it a pass. */
if (NULL == lock_tcpip_core)
return;
/* If we're inside an interrupt, then there's no way we can hold the mutex,
* so give it a pass. */
if (xPortIsInsideInterrupt())
return;
/* Ensure that the mutex is currently taken (locked). */
LWIP_ASSERT("TCPIP core is locked", (0 == uxSemaphoreGetCount(lock_tcpip_core)));
}
#endif // LWIP_TCPIP_CORE_LOCKING
Then I added this block of definitions to lwipopts.h:
/* Define options for core locking. */
#define LWIP_MPU_COMPATIBLE 0
#define LWIP_TCPIP_CORE_LOCKING 1
#define LWIP_TCPIP_CORE_LOCKING_INPUT 0
#if LWIP_TCPIP_CORE_LOCKING
#ifdef __cplusplus
extern "C" {
#endif
/* Will assert if the TCP/IP core mutex is initialized and not locked.
* Defined in sys_arch.c. */
void lwip_assert_core_locked(void);
#define LWIP_ASSERT_CORE_LOCKED() do { lwip_assert_core_locked(); } while (0)
#ifdef __cplusplus
}
#endif
#endif // LWIP_TCPIP_CORE_LOCKING
Then I added the appropriate core lock/unlock calls to my code. Example:
err_t MqttClient::Subscribe(const char * topic, MqttQos qos) {
/* Send the subscribe request. */
LOCK_TCPIP_CORE();
err_t err = mqtt_sub_unsub(client_, topic, ToIntegral(qos), CallbackRequest, this, 1);
UNLOCK_TCPIP_CORE();
[...]
}
And that's pretty much it. If you're not sure whether you need to lock the core manually before an API call, just take a peek inside the body of the function. If you see LWIP_ASSERT_CORE_LOCKED(), then you are required to lock the TCP/IP core before calling it, otherwise you'll probably encounter issues like I did.
I'd like to suggest to the NXP SDK team that the SDK, by default, provide an implementation of LWIP_ASSERT_CORE_LOCKED() for example and SDK projects. My implementation is written strictly for (NO_SYS == 0) and a FreeRTOS environment, so writing a more general implementation will be a bit more complex. But it would probably help keep a number of projects on the rails.
Hopefully this is helpful to some of you out there developing your own lwIP projects.
David R.
I think the kids today call this a "self-own". I did some digging, and now I'll lay out the case for why the SDK should have its own implementation of LWIP_ASSERT_CORE_LOCKED().
My issue is that I was not calling LOCK_TCPIP_CORE() before calling the lwIP MQTT API functions. I didn't immediately realize that core locking was required for this API; for instance, the lwip sockets API doesn't require explicit locking, as it's all handled inside the API. Because my target was lightly loaded on the network, the target would run fine for some period of time, exchanging messages with the MQTT server, but would eventually get itself out of sync. (Edit: after 24 hours of testing, this was definitely the issue.)
What's useful is that lwIP has a protection against just this very scenario: the macro LWIP_ASSERT_CORE_LOCKED(). This macro is called in just about every location where the lwIP code needs exclusive access to the TCP/IP core functions, including the MQTT API. Example:
void
mqtt_disconnect(mqtt_client_t *client)
{
LWIP_ASSERT_CORE_LOCKED();
LWIP_ASSERT("mqtt_disconnect: client != NULL", client);
/* If connection in not already closed */
if (client->conn_state != TCP_DISCONNECTED) {
/* Set conn_state before calling mqtt_close to prevent callback from being called */
client->conn_state = TCP_DISCONNECTED;
mqtt_close(client, (mqtt_connection_status_t)0);
}
}
The macro is documented and defined by default in lwip/src/include/lwip/opt.h:
/**
* Macro/function to check whether lwIP's threading/locking
* requirements are satisfied during current function call.
* This macro usually calls a function that is implemented in the OS-dependent
* sys layer and performs the following checks:
* - Not in ISR (this should be checked for NO_SYS==1, too!)
* - If @ref LWIP_TCPIP_CORE_LOCKING = 1: TCPIP core lock is held
* - If @ref LWIP_TCPIP_CORE_LOCKING = 0: function is called from TCPIP thread
* @see @ref multithreading
*/
#if !defined LWIP_ASSERT_CORE_LOCKED || defined __DOXYGEN__
#define LWIP_ASSERT_CORE_LOCKED()
#endif
As you can see, if it's not otherwise defined, the macro does nothing. If it had been defined, I'm sure I would have noticed my issue a lot sooner. Thankfully, it's not very difficult to implement it. First, we need a function that will assert if the core is not locked. I put this code in lwip/port/sys_arch.c near the top:
#if LWIP_TCPIP_CORE_LOCKING
extern sys_mutex_t lock_tcpip_core;
void lwip_assert_core_locked(void)
{
/* If the mutex hasn't been initialized yet, then give it a pass. */
if (NULL == lock_tcpip_core)
return;
/* If we're inside an interrupt, then there's no way we can hold the mutex,
* so give it a pass. */
if (xPortIsInsideInterrupt())
return;
/* Ensure that the mutex is currently taken (locked). */
LWIP_ASSERT("TCPIP core is locked", (0 == uxSemaphoreGetCount(lock_tcpip_core)));
}
#endif // LWIP_TCPIP_CORE_LOCKING
Then I added this block of definitions to lwipopts.h:
/* Define options for core locking. */
#define LWIP_MPU_COMPATIBLE 0
#define LWIP_TCPIP_CORE_LOCKING 1
#define LWIP_TCPIP_CORE_LOCKING_INPUT 0
#if LWIP_TCPIP_CORE_LOCKING
#ifdef __cplusplus
extern "C" {
#endif
/* Will assert if the TCP/IP core mutex is initialized and not locked.
* Defined in sys_arch.c. */
void lwip_assert_core_locked(void);
#define LWIP_ASSERT_CORE_LOCKED() do { lwip_assert_core_locked(); } while (0)
#ifdef __cplusplus
}
#endif
#endif // LWIP_TCPIP_CORE_LOCKING
Then I added the appropriate core lock/unlock calls to my code. Example:
err_t MqttClient::Subscribe(const char * topic, MqttQos qos) {
/* Send the subscribe request. */
LOCK_TCPIP_CORE();
err_t err = mqtt_sub_unsub(client_, topic, ToIntegral(qos), CallbackRequest, this, 1);
UNLOCK_TCPIP_CORE();
[...]
}
And that's pretty much it. If you're not sure whether you need to lock the core manually before an API call, just take a peek inside the body of the function. If you see LWIP_ASSERT_CORE_LOCKED(), then you are required to lock the TCP/IP core before calling it, otherwise you'll probably encounter issues like I did.
I'd like to suggest to the NXP SDK team that the SDK, by default, provide an implementation of LWIP_ASSERT_CORE_LOCKED() for example and SDK projects. My implementation is written strictly for (NO_SYS == 0) and a FreeRTOS environment, so writing a more general implementation will be a bit more complex. But it would probably help keep a number of projects on the rails.
Hopefully this is helpful to some of you out there developing your own lwIP projects.
David R.
Thank you David, I will report this issue to software team.
Regards
Daniel