MQX TCP bug - Lockup on more than one simultaneous open.

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

MQX TCP bug - Lockup on more than one simultaneous open.

4,124 次查看
chrissolomon
Contributor III

Hi,

I have been working on this problem for about a week, and thought I would share my progress.

(Using MQX 4.0.1, but have checked and MQX 4.0.2.2 does not seem to have a fix.)

 

Problem:

I am working on a product with a 3G modem and a K70 based processor. It connects to the internet using PPP over USBOTG, and runs FTP and telnet servers, uses DNS, and sends data to a remote server (over TCP-IP).

We've noticed that if we cycle connection to the cellular network (simulating a customer moving in and out of coverage) we get reliable repeatable lockups (watchdog resets) .

 

Investigation:

I have tracked through RTCS and I have identified what I believe to be the problem:

To open a socket for an outgoing connection a call to socket_connect is made, which in turn calls the connect macro, which calls SOCK_STREAM_connect.

SOCK_STREAM_connect Has a local variable 'parms', which is populated and passed by reference (using RTCSCMD_issue) to TCP_Process_open.

TCP_Process_open stores this pointer in a linked list of pending open requests.

Note - this is a pointer to a local variable, which will go out of scope.

 

Each time a socket is opened, the local variable has the same address, so when the next pointer is stored it creates a loop in the linked list of pending open requests, which leads to an infinite loop in TCP_Return_open.

 

The same problem occurs when opening a socket to listen:

listen -> SOCK_STREAM_listen -> TCP_Process_open

 

And TCP_Process_accept also looks like it suffers from the same vulnerability.

 

Proposed Solution:

Allocate memory in TCP_Process_open and TCP_Process_accept, and copy the data from the passed pointer to the newly allocated variable, then store that in the linked list.

Memory can be freed in the TCP_Return_open function after the call to RTCSCMD_complete.

I've attached a file with my fix - I've done a few hours bench testing, and it seems to solve the problem.

 

Questions:

Are there any problems with the propose solution?

Alternative solutions?

Original Attachment has been moved to: tcp.c.zip

标记 (5)
0 项奖励
回复
11 回复数

2,822 次查看
eukeniurkiaga
Contributor II

Hi,

We are working with MQX 4.2.0.2 and we are suffering the same problem.

is there any official patch to solve this problem? I can not find any, and Chris described perfectly more than one year ago.

Regards,

0 项奖励
回复

2,822 次查看
eukeniurkiaga
Contributor II

Hi, I have bee investigating a little more and I have detected that the problem was resolved in MQX RTOS 4.2.0.1 patch (MQX-5591). However all the changes done in tcp.c in 4.2.0.1 patch , have been undone in MQX RTOS 4.2.0.2 patch. Why? Has it been an errror?

Regards,

0 项奖励
回复

2,823 次查看
franciscolovera
Contributor I

Do you have a copy of the patch you applied?

0 项奖励
回复

2,823 次查看
danielchen
NXP TechSupport
NXP TechSupport

please see the attached file

0 项奖励
回复

2,823 次查看
danielchen
NXP TechSupport
NXP TechSupport

Thank you very much for your feedback. I think the mqx 4.2.0.2 patch should include mqx 4.2.0.1 patch, there maybe a mistake here. You can manually merge the bug fix to your project.

 I am sorry that there will be no new release for MQX 4.2 any more.

MQX v5 is a continuation of MQX 4.2 with commercial license.

Please check below link for more details

MQX™ v5 Software Solutions|NXP 

Regards

Daniel

0 项奖励
回复

2,823 次查看
Martin_
NXP Employee
NXP Employee

There is a problem that when an active open request (connect()) changes state from SYN_SENT to SYN_RECEIVED and the TCB is closed on error (perhaps due to send timeout) the connect request is not removed from the OPENS list. So the TCB closes, but the request stays in the OPENS list, TCP_PARM in next connect() has the same address (as it would be when TCP_PARM is on stack) we have the problem. The same issue would occur if application request closes the connecting socket (connect() in SYN_RECEIVED state).

What can help - make sure the TCB, that is being deallocated, is not left in the OPENS list - lines 62-95:

/*FUNCTION*-------------------------------------------------------------

*

* Function Name   : TCP_Free_TCB

* Returned Values : None.

* Comments        :

*

*  Deallocate TCB; must only be done when TCB has been closed and released.

*

*END*-----------------------------------------------------------------*/

void TCP_Free_TCB

   (

      TCB_STRUCT_PTR       tcb,     /* IN/OUT - TCP context */

      TCP_CFG_STRUCT_PTR   tcp_cfg  /* IN/OUT - TCP layer data */

   )

{ /* Body */

   TCB_STRUCT_PTR    tmp;

  

   if(NULL==tcb) return;

   /*

   ** Free everything which wasn't freed in TCP_Close_TCB() ...

   */

   TCP_Truncate_receive_chunks(tcb, tcp_cfg, tcb->rcvlen); /* delete all chunk nodes */

   if ( tcb->rcvringbuf != NULL ) {

      _mem_free(tcb->rcvringbuf);

      tcb->rcvringbuf = NULL;

   } /* Endif */

   /*

   ** Deallocate TCB itself

   */

   if ( tcb != NULL ) {

      if ( tcb == tcp_cfg->TCBhead ) {

         tcp_cfg->TCBhead = tcb->next;

      } else {

         tmp = tcp_cfg->TCBhead;

         while ( tmp != NULL && tmp->next != tcb ) {

            tmp = tmp->next;

         } /* Endwhile */

         if ( tmp != NULL ) {

            tmp->next = tcb->next;

         } /* Endif */

      } /* Endif */

   } /* Endif */

   if (tcb->SOCKET) {

      ((SOCKET_STRUCT_PTR)tcb->SOCKET)->TCB_PTR = NULL;

   } /* Endif */

   tcb->VALID = TCB_INVALID_ID;

   tcp_cfg->CONN_COUNT--;

#if RTCSCFG_TCP_MAX_HALF_OPEN

    TCB_HALF_OPEN_DROP

#endif

  

   _mem_free(tcb);

  

   /* double check OPENS list.

    * if the tcb is in OPENS list now, just return with error,

    * so that application is aware.

    * normally it should not happen, as accept()/connect() request

    * if removed from OPENS list during ESTABLISHED and FINWAIT_1 states.

    */

    {

       TCP_PARM_PTR  req_ptr;     /* upper layer request */

       TCP_PARM_PTR  prev_ptr;    /* previous upper layer request */

      

       prev_ptr = NULL;

       req_ptr = tcp_cfg->OPENS;

       while (req_ptr != NULL &&

              tcb != req_ptr->TCB_PTR)

       {

          prev_ptr = req_ptr;

          req_ptr = req_ptr->NEXT;

       }

      

       if (req_ptr != NULL)

       {

          if (prev_ptr == NULL)

          {

             tcp_cfg->OPENS = req_ptr->NEXT;

          }

          else

          {

             prev_ptr->NEXT = req_ptr->NEXT;

          } /* Endif */

          req_ptr->NEXT = NULL;

          req_ptr->TCB_PTR = tcb;

          RTCSCMD_complete(req_ptr, RTCSERR_TCP_CONN_RLSD);

       }

    }

} /* Endbody */

0 项奖励
回复

2,823 次查看
Carlos_Musich
NXP Employee
NXP Employee

Hi Chris,

I will try to reproduce this issue and report it to MQX development team.

If you have a simple porject where I can see this behavior it would be very helpful.

Regards,

Carlos

0 项奖励
回复

2,823 次查看
chrissolomon
Contributor III

Hi Carlos,

we have had the originally proposed fix in production from a month or so after my original post.

It works for us.

Glad to see that this bug may finally be getting some attention.

Chris

2,823 次查看
bowe
Contributor III

Our solution...err...(poor) work-around was to mutex the connect function, so that only one tcp connect could be pending at any given time.  Of course, this has the side effect of only allowing one task at a time to perform a connect, which can delay other tasks significantly in the case of a connect timeout.  Not a good work-around, but it seems to work (and perhaps gives a better hint into the problem?).

-Bowe

0 项奖励
回复

2,823 次查看
dianatorres
NXP Employee
NXP Employee

Original Content posted by Manh Nguyen on Feb 4, 2016

Hello Chris,

 

Your solution can fix this bug but it caused a new bug which was described athttps://community.freescale.com/thread/383555

The function connect() always return RTCS_OK even if it fails

 

In RTCS_cmd_issue() function:

    Execute TCP_Process_open => ... => TCP_Return_open

    Function TCP_Return_open() will store error code to tcp_cfg->OPENS->ERROR field

    but due to tcp_cfg->OPENS is pointed to "open_rqst_ptr" therefore error code will be

    stored to "open_rqst_ptr" not "parms" and function RTCS_cmd_issue() always return RTCS_OK

 

 

uint32_t  SOCK_STREAM_connect

{

    TCP_PARM parms;

   

    // parms is passed to TCP_Process_open() through RTCS_cmd_issue()

    return RTCS_cmd_issue(&parms, TCP_Process_open);

}

 

 

void RTCS_cmd_issue(parms_ptr, ...)

{

    // Send message to execute TCP_Process_open, TCP_Return_open

    // and block task until connect success or timeout

    //...

    return parm_ptr->ERROR;

}

 

 

void TCP_Process_open(parms_ptr)

{

    TCP_PARM_PTR         open_rqst_ptr;

   

    // Allocate and copy parms to open_rqst_ptr

    // ...

    tcp_cfg->OPENS = open_rqst_ptr;

}

 

 

void TCP_Return_open(tcb, errnum, tcp_cfg)

{

    TCP_PARM_PTR reg_ptr = tcp_cfg->OPENS;

   

    // RTCSCMD_complete(req_ptr, errnum);

    if (errnum)

    {

        // due to tcp_cfg->OPENS is pointed to "open_rqst_ptr" therefore error code will be

        // stored to "open_rqst_ptr" not "parms" and parms->ERROR is RTCS_OK always

        reg_ptr->ERROR = errnum;

    }

}

 

 

I suggest a other solution: We will allocate memory in SOCK_STREAM_connect() instead of allocate in TCP_Process_open() and free it before return parm_ptr->ERROR. I thought this solution can resolve this issue without cause a new bug but I'm not sure because I can't reproduce this issue.

Can you have a look on my solution?

0 项奖励
回复

2,823 次查看
bowe
Contributor III

I would like to bump this thread.  We are seeing a similar issue in MQX for KSDK 1.1.  We are using client sockets in MQX to connect to a server and push data.  There are multiple tasks that could be performing these pushes, and thus multiple connections.  This occasionally locks up, and when we attach a debugger, we find that the tcp_cfg->OPENS list in TCP_Return_open() consists of two elements that have been linked into a loop (we have yet to see a time where there are more than two elements, but I suspect it could happen with two or more elements).  Therefore, it is in an infinite while loop.

-Bowe

0 项奖励
回复