Bug: Data "corruption" when using detached tasks for CGI or SSI

bowe · ‎05-15-2015

We are using MQX for Kinetis SDK 1.1. We think we have found a bug in the CGI handling. I have searched the forums a little, and haven’t really seen anything similar. In looking at the code for MQX for Kinetis SDK 1.2, it looks like this bug has not been fixed.

We finally saw this bug when we went to upload a relatively large file (28 KB) to a CGI endpoint, where the handling function was run in its own task (which I will refer to as the handling task, which is distinct form the CGI handler task). Essentially what was happening was the is the CGI handler was not waiting for the handling task to finish, and would strip the HTTP data out (assuming it was left over data that the handling task didn’t process) before the handling task could receive it. If the handling function was run in the CGI handler task, then it was a function call for the CGI handler instead of a task create, so it would wait for the handling function to finish. On a small file, the handling task could grab the data quickly before the CGI handler trashed it. So we only saw this issue on (relatively) larger files/data and when the handling function was run in its own task.

We have found a solution that seems to work. We created a function modeled after RTCS_task_create() that would create the task and wait for it to complete (unsurprisingly called RTCS_task_create_and_wait()). Instead of using a semaphore and waiting for the child (which is the handling task) to unblock the parent (the CGI handler task), we had the parent watch for the child to terminate. We could not find an MQX function to wait for a task to complete or terminate, so we mimicked that behavior by waiting for the child task descriptor to no longer exist or to list a different parent (in the case that a new task is spawned from a different parent task with the same task ID shortly after our child task terminates). We chose this method over the semaphore method because we felt it was more robust to the child task crashing, which we were concerned would probably cause the CGI handler task to deadlock in the semaphore based approach.

The following function prototype was added to rtcs25x.h:

extern uint32_t RTCS_task_create_and_wait
(
   char          *name,
   uint32_t           priority,
   uint32_t           stacksize,
   void (_CODE_PTR_  start)(void *, void *),
   void             *arg
);

And the following definition was added to rtcstask.c:

/*FUNCTION*-------------------------------------------------------------
*
* Function Name   : RTCS_task_create_and_wait
* Returned Values : uint32_t (error code)
* Comments        :
*     Create a task and wait for it to finish running.
*
*END*-----------------------------------------------------------------*/
uint32_t RTCS_task_create_and_wait
   (
      char          *name,
      uint32_t           priority,
      uint32_t           stacksize,
      void (_CODE_PTR_  start)(void *, void *),
      void             *arg
   )
{ /* Body */
   TASK_TEMPLATE_STRUCT    task_template;
   struct rtcs_task_state  task;
#if (RTCSCFG_ENABLE_ASSERT_PRINT==1) ||  (RTCSCFG_ENABLE_ASSERT==1)
  /* for TCP/IP task we bypass this check as for this one */
  /* priority = _RTCSTASK_priority */
  if(TRUE == _RTCS_initialized)
  {
    RTCS_ASSERT(priority>_RTCSTASK_priority);
  }
#endif
   RTCS_sem_init(&task.sem);
   task.start = start;
   task.arg   = arg;
   task.error = RTCS_OK;
   _mem_zero((unsigned char *)&task_template, sizeof(task_template));
   task_template.TASK_NAME          = name;
   task_template.TASK_PRIORITY      = priority;
   task_template.TASK_STACKSIZE     = stacksize;
   task_template.TASK_ADDRESS       = RTCS_task;
   task_template.CREATION_PARAMETER = (uint32_t)&task;
   _task_id myTID = _task_get_id();
   _task_id childTID = _task_create(0, 0, (uint32_t)&task_template);
   if (childTID == MQX_NULL_TASK_ID) {
      RTCS_sem_destroy(&task.sem);
      return RTCSERR_CREATE_FAILED;
   } /* Endif */
   // Wait until its out on its own
   RTCS_sem_wait(&task.sem);
   RTCS_sem_destroy(&task.sem);
    // Watch it die
    TD_STRUCT_PTR childTD = (TD_STRUCT_PTR) _task_get_td(childTID);
    while(childTD != NULL)
    {
        // Paternity Test:  If this task is no longer my child, abandon it
        if(myTID != childTD->PARENT)
        {
            break;
        }
        _sched_yield();
        childTD = (TD_STRUCT_PTR) _task_get_td(childTID);
    }
   return task.error;
} /* Endbody */

Then RTCS_task_create() was changed to RTCS_task_create_and_wait() in the function httpsrv_detach_script() in httpsrv_script.c:

/*
** Detach script processing to separate task
**
** IN:
**      HTTPSRV_SESSION_STRUCT* session - session structure pointer.
**      HTTPSRV_STRUCT *server - pointer to server structure (needed for session parameters).
**
** OUT:
**      none
**
** Return Value: 
**      none
*/
void httpsrv_detach_script(HTTPSRV_DET_TASK_PARAM* task_params)
{
    _mqx_uint  error;
    _mqx_uint  priority;
    error = _task_get_priority(MQX_NULL_TASK_ID, &priority);
    if (error != MQX_OK)
    {
        return;
    }
    error = RTCS_task_create_and_wait(HTTPSRV_DETACHED_SCRIPT_TASK_NAME, priority, task_params->stack_size, httpsrv_detached_task, (void *)task_params);
    if (error != MQX_OK)
    {
        return;
    }
}

Like I said, this seems to work for us, but is probably not the best or most eloquent solution. I am certainly open for suggestions on improvements.

-Bowe

bowe · ‎12-30-2015

This bug is still present in the RTCS of MQX for KSDK 1.3. The original solution we came up with is not nearly as simple as our current solution. Our current solution involves splitting out the cleanup code in httpsrv_process_cgi() into another function called httpsrv_cleanup_cgi(), (these are in httpsrv_script.c; of course a function prototype for httpsrv_cleanup_cgi() needs to be added to httpsrv_script.h). Here is what it looks like after splitting it:

/*
** Function for CGI request processing
**
** IN:
**      HTTPSRV_SESSION_STRUCT* session - session structure pointer.
**      HTTPSRV_STRUCT*         server - pointer to server structure (needed for session parameters).
**      char*                   cgi_name - name of cgi function.
**
** OUT:
**      none
**
** Return Value: 
**      none
*/
void httpsrv_process_cgi(HTTPSRV_STRUCT *server, HTTPSRV_SESSION_STRUCT *session, char* cgi_name)
{
    HTTPSRV_SCRIPT_MSG message;

    RTCS_ASSERT(session != NULL);
    RTCS_ASSERT(server != NULL);
    RTCS_ASSERT(cgi_name != NULL);

    message.session = session;
    message.type = HTTPSRV_CGI_CALLBACK;
    message.name = cgi_name;
    message.ses_tid = _task_get_id();
    _lwmsgq_send(server->script_msgq, (_mqx_max_type*) &message, LWMSGQ_SEND_BLOCK_ON_FULL);
    /* wait until CGI is processed */
    _task_block();

    return;
}

void httpsrv_cleanup_cgi(HTTPSRV_SESSION_STRUCT *session)
{
    /*
    ** There is some unread content from client after CGI finished. 
    ** It must be read and discarded if we have keep-alive enabled
    ** so it does not affect next request.
    */
    if (session->request.content_length)
    {
        char *tmp = NULL;

        tmp = _mem_alloc(HTTPSRV_TMP_BUFFER_SIZE);
        if (tmp != NULL)
        {
            uint32_t length = session->request.content_length;

            while(length)
            {
                int32_t retval;

                retval = httpsrv_read(session, tmp, HTTPSRV_TMP_BUFFER_SIZE);
                if ((retval == 0) || (retval == RTCS_ERROR))
                {
                    break;
                }
                length -= retval;
            }
            _mem_free(tmp);
            session->request.content_length = 0;
        } 
    }
    return;
}

And then what httpsrv_scipt.h looks like with the prototype added:

/**HEADER********************************************************************
* 
* Copyright (c) 2013 Freescale Semiconductor;
* All Rights Reserved                       
*
*************************************************************************** 
*
* THIS SOFTWARE IS PROVIDED BY FREESCALE "AS IS" AND ANY EXPRESSED OR 
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  
* IN NO EVENT SHALL FREESCALE OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 
* STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 
* IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF 
* THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************
*
* $FileName: httpsrv_prv.h$
*
* Comments:
*
*   Header for HTTPSRV.
*
*END************************************************************************/

#ifndef HTTPSRV_SCRIPT_H_
#define HTTPSRV_SCRIPT_H_


#include "httpsrv_prv.h"

#ifdef __cplusplus
extern "C" {
#endif

uint32_t httpsrv_wait_for_cgi(HTTPSRV_SESSION_STRUCT *session);
void httpsrv_detach_script(HTTPSRV_DET_TASK_PARAM* task_params);
HTTPSRV_FN_CALLBACK httpsrv_find_callback(HTTPSRV_FN_LINK_STRUCT* table, char* name, uint32_t* stack_size);
void httpsrv_call_cgi(HTTPSRV_CGI_CALLBACK_FN function, HTTPSRV_SCRIPT_MSG* msg_ptr);
void httpsrv_call_ssi(HTTPSRV_SSI_CALLBACK_FN function, HTTPSRV_SCRIPT_MSG* msg_ptr);
void httpsrv_process_cgi(HTTPSRV_STRUCT *server, HTTPSRV_SESSION_STRUCT *session, char* cgi_name);
void httpsrv_cleanup_cgi(HTTPSRV_SESSION_STRUCT *session);
#ifdef __cplusplus
}
#endif

#endif

And then finally we need to call httpsrv_cleanup_cgi() at the appropriate times, which is in two places in httpsrv_http_process() in httpsrv_task.c. Basically right after httpsrv_wait_for_cgi() is called. Here is the relevant (modified) code in that function (lines 09 & 40 contain the httpsrv_cleanup_cgi() calls):

        case HTTPSRV_SES_END_REQ:
            if (!(session->flags & HTTPSRV_FLAG_IS_KEEP_ALIVE))
            {
                httpsrv_ses_set_state(session, HTTPSRV_SES_CLOSE);
            }
            else
            {
                httpsrv_wait_for_cgi(session);
                httpsrv_cleanup_cgi(session);
                /* Re-init session */
                httpsrv_ses_set_state(session, HTTPSRV_SES_WAIT_REQ);
                if (session->response.file)
                {
                    fclose(session->response.file);
                }
                _mem_zero(&session->response, sizeof(session->response));
                if (session->request.auth.user_id != NULL)
                {
                    _mem_free(session->request.auth.user_id);
                }
                session->request.auth.user_id = NULL;
                session->request.auth.password = NULL;
                session->time = RTCS_time_get();
                session->timeout = HTTPSRVCFG_KEEPALIVE_TO;
                session->flags = HTTPSRV_FLAG_IS_KEEP_ALIVE | HTTPSRV_FLAG_PROCESS_HEADER;
            }
            break;
        case HTTPSRV_SES_CLOSE:
            /*
             * If session did not time out; wait for lock. If waiting
             * for lock timed-out, check session timeout again.
             */
            if (time_interval < session->timeout)
            {
                if (httpsrv_wait_for_cgi(session) == MQX_LWSEM_WAIT_TIMEOUT)
                {
                    break;
                }
            }
            httpsrv_cleanup_cgi(session);
            httpsrv_ses_close(session);
            session->valid = 0;
            break;

In addition, there is a timeout for CGI tasks. We found that if we were transferring a large file (for instance a firmware update), the CGI task would time out and get killed. So we added a an additional_timeout field to the CGI callback table (this additional_timeout gets set to 0 for tasks that run fine with the default timeout). When the CGI callback is found in httpsrv_find_callback(), the additional_timeout is one of the parameters that is filled in and returned (similar to stack_size), or it is zeroed in the case that no callback is returned. Then in httpsrv_script_task() in httpsrv_task.c, after httpsrv_find_callback() is called, the additional_timeout found is added to message.session->timeout. I leave this up to the reader to decide how to implement, as I am not particularly confident this is the best way to solve the issue.

Last note, there does seem to be bug in SOCK_STREAM_recv(), which is in sock_stream.c. I am covering this in another post called: SOCK_STREAM_recv() bug drops data on receive timeout.

View solution in original post

bowe · ‎12-30-2015

This bug is still present in the RTCS of MQX for KSDK 1.3. The original solution we came up with is not nearly as simple as our current solution. Our current solution involves splitting out the cleanup code in httpsrv_process_cgi() into another function called httpsrv_cleanup_cgi(), (these are in httpsrv_script.c; of course a function prototype for httpsrv_cleanup_cgi() needs to be added to httpsrv_script.h). Here is what it looks like after splitting it:

/*
** Function for CGI request processing
**
** IN:
**      HTTPSRV_SESSION_STRUCT* session - session structure pointer.
**      HTTPSRV_STRUCT*         server - pointer to server structure (needed for session parameters).
**      char*                   cgi_name - name of cgi function.
**
** OUT:
**      none
**
** Return Value: 
**      none
*/
void httpsrv_process_cgi(HTTPSRV_STRUCT *server, HTTPSRV_SESSION_STRUCT *session, char* cgi_name)
{
    HTTPSRV_SCRIPT_MSG message;

    RTCS_ASSERT(session != NULL);
    RTCS_ASSERT(server != NULL);
    RTCS_ASSERT(cgi_name != NULL);

    message.session = session;
    message.type = HTTPSRV_CGI_CALLBACK;
    message.name = cgi_name;
    message.ses_tid = _task_get_id();
    _lwmsgq_send(server->script_msgq, (_mqx_max_type*) &message, LWMSGQ_SEND_BLOCK_ON_FULL);
    /* wait until CGI is processed */
    _task_block();

    return;
}

void httpsrv_cleanup_cgi(HTTPSRV_SESSION_STRUCT *session)
{
    /*
    ** There is some unread content from client after CGI finished. 
    ** It must be read and discarded if we have keep-alive enabled
    ** so it does not affect next request.
    */
    if (session->request.content_length)
    {
        char *tmp = NULL;

        tmp = _mem_alloc(HTTPSRV_TMP_BUFFER_SIZE);
        if (tmp != NULL)
        {
            uint32_t length = session->request.content_length;

            while(length)
            {
                int32_t retval;

                retval = httpsrv_read(session, tmp, HTTPSRV_TMP_BUFFER_SIZE);
                if ((retval == 0) || (retval == RTCS_ERROR))
                {
                    break;
                }
                length -= retval;
            }
            _mem_free(tmp);
            session->request.content_length = 0;
        } 
    }
    return;
}

And then what httpsrv_scipt.h looks like with the prototype added:

/**HEADER********************************************************************
* 
* Copyright (c) 2013 Freescale Semiconductor;
* All Rights Reserved                       
*
*************************************************************************** 
*
* THIS SOFTWARE IS PROVIDED BY FREESCALE "AS IS" AND ANY EXPRESSED OR 
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  
* IN NO EVENT SHALL FREESCALE OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 
* STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 
* IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF 
* THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************
*
* $FileName: httpsrv_prv.h$
*
* Comments:
*
*   Header for HTTPSRV.
*
*END************************************************************************/

#ifndef HTTPSRV_SCRIPT_H_
#define HTTPSRV_SCRIPT_H_


#include "httpsrv_prv.h"

#ifdef __cplusplus
extern "C" {
#endif

uint32_t httpsrv_wait_for_cgi(HTTPSRV_SESSION_STRUCT *session);
void httpsrv_detach_script(HTTPSRV_DET_TASK_PARAM* task_params);
HTTPSRV_FN_CALLBACK httpsrv_find_callback(HTTPSRV_FN_LINK_STRUCT* table, char* name, uint32_t* stack_size);
void httpsrv_call_cgi(HTTPSRV_CGI_CALLBACK_FN function, HTTPSRV_SCRIPT_MSG* msg_ptr);
void httpsrv_call_ssi(HTTPSRV_SSI_CALLBACK_FN function, HTTPSRV_SCRIPT_MSG* msg_ptr);
void httpsrv_process_cgi(HTTPSRV_STRUCT *server, HTTPSRV_SESSION_STRUCT *session, char* cgi_name);
void httpsrv_cleanup_cgi(HTTPSRV_SESSION_STRUCT *session);
#ifdef __cplusplus
}
#endif

#endif

And then finally we need to call httpsrv_cleanup_cgi() at the appropriate times, which is in two places in httpsrv_http_process() in httpsrv_task.c. Basically right after httpsrv_wait_for_cgi() is called. Here is the relevant (modified) code in that function (lines 09 & 40 contain the httpsrv_cleanup_cgi() calls):

        case HTTPSRV_SES_END_REQ:
            if (!(session->flags & HTTPSRV_FLAG_IS_KEEP_ALIVE))
            {
                httpsrv_ses_set_state(session, HTTPSRV_SES_CLOSE);
            }
            else
            {
                httpsrv_wait_for_cgi(session);
                httpsrv_cleanup_cgi(session);
                /* Re-init session */
                httpsrv_ses_set_state(session, HTTPSRV_SES_WAIT_REQ);
                if (session->response.file)
                {
                    fclose(session->response.file);
                }
                _mem_zero(&session->response, sizeof(session->response));
                if (session->request.auth.user_id != NULL)
                {
                    _mem_free(session->request.auth.user_id);
                }
                session->request.auth.user_id = NULL;
                session->request.auth.password = NULL;
                session->time = RTCS_time_get();
                session->timeout = HTTPSRVCFG_KEEPALIVE_TO;
                session->flags = HTTPSRV_FLAG_IS_KEEP_ALIVE | HTTPSRV_FLAG_PROCESS_HEADER;
            }
            break;
        case HTTPSRV_SES_CLOSE:
            /*
             * If session did not time out; wait for lock. If waiting
             * for lock timed-out, check session timeout again.
             */
            if (time_interval < session->timeout)
            {
                if (httpsrv_wait_for_cgi(session) == MQX_LWSEM_WAIT_TIMEOUT)
                {
                    break;
                }
            }
            httpsrv_cleanup_cgi(session);
            httpsrv_ses_close(session);
            session->valid = 0;
            break;

In addition, there is a timeout for CGI tasks. We found that if we were transferring a large file (for instance a firmware update), the CGI task would time out and get killed. So we added a an additional_timeout field to the CGI callback table (this additional_timeout gets set to 0 for tasks that run fine with the default timeout). When the CGI callback is found in httpsrv_find_callback(), the additional_timeout is one of the parameters that is filled in and returned (similar to stack_size), or it is zeroed in the case that no callback is returned. Then in httpsrv_script_task() in httpsrv_task.c, after httpsrv_find_callback() is called, the additional_timeout found is added to message.session->timeout. I leave this up to the reader to decide how to implement, as I am not particularly confident this is the best way to solve the issue.

Last note, there does seem to be bug in SOCK_STREAM_recv(), which is in sock_stream.c. I am covering this in another post called: SOCK_STREAM_recv() bug drops data on receive timeout.

isaacavila · ‎06-08-2015

Hi Bowe,

First of all, I will ask you for your code to test this issue (including your solution too) and validate with development team, so they could solve it on next releases.

You comment about you do not prefer semaphore based solution due if child task crashes it will make CGI handler task to crash too, but, what happens if in your solution, your child task crashes? It will not terminate and the while loop will be running always so it will cause CGI handler task to crash too?

Another solution (maybe it could be better) could be to implement a semaphore (or mutex if you desire so) technique, and to avoid child task to get crashed, you can implement a watchdog timer task inside your child task, this way you ensure your child task will not get crashed or loop.

You can find more information about watchdog timer (in mqx) on MQX's reference manual located at: C:\Freescale\KSDK_1.2.0\doc\rtos\mqx, also, attached document explains more about it.

I hope this feedback can help you.

Regards,

Isaac Avila

bowe · ‎07-07-2015

I cannot really include my code, but I can try to write a snippet that exposes the issue. I suspect that just adding a 1 ms delay with _time_delay(1) in a detached CGI task will show the bug. I will reply back if/when I get code that exposes this.

Yes, given your explanation, it would seem that the solution I posted is not as robust against the child process crashing as I though. In my head, I was thinking that if the child task crashed, MQX would remove it from the task queue; however, I do not believe this is the case. In this case, the CGI handler would not crash, but it would never terminate. I suppose an additional check on the child task to make sure it is still running/viable is required.

The watchdog timer inside the child task is certainly an option. I do not like the idea of complicating the child task, and the break in abstraction it causes. I would much rather somehow provide this feature as a wrapper around the child.

-Bowe

Bug: Data "corruption" when using detached tasks for CGI or SSI

Bug: Data "corruption" when using detached tasks for CGI or SSI

Real-Time TCPIP Communications Suite