interniche

wkurzbauer · ‎06-25-2018

Hello,

Please can you tell me the most up-to-date version of Nichelite for coldfire processors.

I run a digibutler board (MFC52531) from the lektor magaize in the year 2008.

As I have problems with the TCP/IP stack (heap problems) I would like to know if

eventually a newer revision than mine (rev.3.2) exists.

If so please can you also provide me a link to it.

Thank you

Werner Kurzbauer

wkurzbauer · ‎08-02-2018

Hi,

>Were they in the Demo Example HTTP Server or in the Nichelite code, or both? You can't expect that much from sample code written for an App Note.

The Freescale code has bugs and particularly it appears that it is in fact quick and dirty code...

The Nichelite code seems to be ok except some minor things (but I do not know how much the code has been amended by freescale)

for example in tcpin.c:

    #ifdef DOS_RST
   /* DOS_RST - Fix for "Denial of Service (DOS) using RST"
    * An intruder can send RST packet to break on existing TCP
    * connection. It means that if a RST is received in
    * ESTABLISHED state from an intruder, then the connection gets
    * closed between the original peers. To overcome this
    * vulnerability, it is suggested that we accept RST only when
    * the sequence numbers match. Else we send an ACK.
    */
   if ((tiflags & TH_RST) && (tp->t_state == TCPS_ESTABLISHED) &&
   //   (ti->ti_seq != tp->rcv_nxt))    //original code from Nichelite v3.2 leads to compile error
        (ptcp->th_seq != tp->rcv_nxt))             <<<<<<<<<<<<<<<<assumed correction ....
   {
      /* RST received in established state and sequence numbers
       * don't match.
       */
      tiflags &= ~TH_RST; /* clear reset flag */
      goto dropafterack;   /* send an ack and drop current packet */
   }
   #endif /* DOS_RST */

>I still prefer my fix, have you tried it?

Yes , in fact only the inserting "emg_web_rewind(length)" in case of EWOULDBLOCK and thus resend the chunk works fine and up until no no crashes -- I agree that you fix is better compared to having a loop withing the freesacel_http_send function...

>???? The board you're using is 10 years old. The Nichelite stack dates from 2008 or 2009. How come you're having these >problems NOW instead of THEN? Have you just started using this? Why not something more modern?

I have some hardware (relays etc.) on the board and after I programmed the board in 2011 I used the board for controlling the shutters. I used a timer to power on the board only during the day, but as the intruder attacks augmented the board got stuck even within one day. So I decided to find the problem and it was not easy to reopen the project after 7 years... :smileywink:

But I hate to throw things away and in fact its fun and - with the newer stuff (e.g. ESP-01 etc.) it is much more difficult to "dig into the guts" and to find out how things are working.....

cheers

Werner

TomE · ‎08-02-2018

> I agree that you fix is better compared to having a loop withing the freesacel_http_send function...

Which was also holding data in "scratch_ram_for_send" that could be used by other connections while your one is waiting. Except I suspect that this code only uses a single thread for all sockets, and your change would mean that one blocked connection would block all other current and future connections until it timed out.

> I used the board for controlling the shutters.

The perils of Home Automation. You probably need that thing (and any other "Internet of Things" devices you have there) behind a firewall that blocks all "attackers" for you.

> but I do not know how much the code has been amended by freescale

About this much:

 .._v3.2_MVDH_20110524_CW7.2/src/projects % find NicheLite -type f | xargs grep FSL | wc
    365    4240   41131
 .._v3.2_MVDH_20110524_CW7.2/src/projects % find example -type f | xargs grep FSL | wc 
    202    2257   23870
‍‍‍‍

About 587 times. The fact that the changes were all commented with "//FSL" indicates they mightn't have been using a Version Control System. Or if they were, it wasn't one that made it easy to publish the "tree", like using "git" does.

They have been using git with their i.MX projects, and you can see the progress of the software release for their i.MX53 chips here:

http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=imx_2.6.35_maintain

The tree starts with Linux 2.6.35.3 in August 2010. About 1200 patches were then applied on top of that (mainly adding and replacing drivers and CPU-related things) to release three versions in 2011. Then a "maintain" branch was created, and about 150 patches were made on that up to 2013. No more development or fixes since that date.

Fine, except the Multicast didn't work on Ethernet, CSPI didn't work properly, CAN had lots of problems, the CPU IOMUX Pins were programmed wrongly, the PWM glitches, the NAND controller has lots of problems and so on.

https://community.nxp.com/message/593611

https://community.nxp.com/message/522832

You're lucky. You've only got to find and fix bugs in 287 files with 75,000 lines of code, taking 3.4 Megs. The Linux tree has 37,000 files with 15 million lines taking 551 Megs!

Tom

wkurzbauer · ‎07-24-2018

Hi,

I just would like to report that the solution appears to work well. Up until now no heap problem...

Again thank you for your help!

The sole problem I have now that as long as I am in my LAN everything works fine, but if I access the webserver

from outside it happens quite frequently that packets get lost and the javascript file or the HTML file which is sent by the webserver are not sent correctly (both are about 10kB).

I checked the received code and I found out that !within! the sent files pieces of 1400 bytes (exactly the

MAX_BYTES_TO _SEND in one packet are missing).

Maybe it be that the maximum segment life time (which is currently set

to 2 in main .c:

   /* Heap memory saving trick - reduce the time a TCP socket
    * will linger in CLOSE_WAIT state. For systems with limited
    * heap space and a busy web server, this makes a big difference.
    */
   TCPTV_MSL = 2;    /* set low max seg lifetime default */

is too short? Otherwise I do not understand how parts of the code can be missing within a transferred file in a TCP connection which should guarantee a correct file transfer...

Do you have any possible explanation how this can happen without any reported error neither on the client (browser) side nor on the server side?

Thank you and cheers

Werner

TomE · ‎07-24-2018

That means the TCP code is thoroughly broken. You should ditch it and replace it with something else. If it can't get this right, then I wouldn't trust it at all. It may be "originally broken" or you may have made some changes to it that caused this problem. So first I'd suggest checking every change you've made to the code from the original that you started with, and also check against that "updated" version.

The tricky bit is needing to understand everything about how TCP is meant to work and how all the code works in order to understand the impact of any changes.

I'd suggest trying to change over to a different TCP/IP stack. Have a look at this one:

https://community.nxp.com/thread/57196

The link at the top of that post (there's 9 full pages in the thread) doesn't work - so look here:

https://sourceforge.net/projects/fnet/

There's an active forum on this stack there too.

Your code is doing something really bad. The whole POINT of TCP (when compared to something like UDP) is that transmission errors are expected and must be handled. Put simply, every packet sent has a sequence number. Put properly, every BYTE send has a sequence number and the packets have byte sequence ranges. When a packet goes missing somewhere, the missing range is sent again when the remote doesn't ack receipt.

Packets can go missing "in the wild" on the Internet, in intermediate routers or switches, or even in the Ethernet Driver code in the computer sending or receiving. TCP has to handle all of these.

The fact that you're getting a "valid data stream" with a full-sized Ethernet packet's worth of data going missing means that the part of the TCP stack itself that is meant to perform the retries is broken, and is probably retransmitting the wrong data.

I'm guessing that when the stack has to retransmit from "Sequence N" it wrongly skips forward in the data stream and sends the next lot.

Or it could be that for efficiency and working on small memory systems, the stack doesn't copy the data in order to retransmit it, but relies on the data buffer from the user code (the HTTP Server) remaining valid until it has been acknowledged. If it is doing that there has to be some form of "handshake" back to your HTTP stack to tell it when it can reuse the buffer. If that isn't working, your HTTP Server code might be overwriting the buffer with the next lot of data. So it could be the HTTP Server isn't using the TCP interface properly. I'd check that first. Maybe Nichelite requires a "blocking write" for TCP data transmission and you're using a non-blocking call or something?

Debugging this is HARD as you've got to have network errors in order to trigger this so you can debug it.

Debugging this is EASY as all you have to do is to add a small amount of debugging code to the low-level Ethernet driver (or between TCP and Ethernet) to corrupt every 10th packet. Just have it add one to a byte in the data stream so the checksum is now bad, or to make tracing the problem easier, for every TCP packet sent that is bigger than 1000 bytes (so it's in the middle of a data stream), write a string like "#################" at offset 100 or so. You'll be able to see that in a Wireshark dump. Then run Wireshark on the PC running the web browser, do the transfer, look for the "######" and then see what happens with the sequence numbers and data when the packet is retransmitted.

???? The board you're using is 10 years old. The Nichelite stack dates from 2008 or 2009. How come you're having these problems NOW instead of THEN? Have you just started using this? Why not something more modern?

Tom

TomE · ‎07-29-2018

> Or it could be that for efficiency and working on small memory systems, the stack doesn't

> copy the data in order to retransmit it, but relies on the data buffer from the user code ...

The Nichelite TCP stack allows both methods according to the documentation. There's a "copy this data" standard socket interface" (m)send()) as well as a more complicated "manage the buffer" one (tcp_send()).

Are you using the web server source from here or the previous equivalent:

"7.2 REG_ABI 20110524/CF_Lite_v3.2_MVDH_20110524_CW7.2/src/projects/example/freescale_HTTP_Web_Server/"

if so, then it uses "m_send()" for everything.

BUT it looks like it sets the socket non-blocking:

freescale_http_server.c:  m_ioctl(so, SO_NONBLOCK, NULL); /* make socket non-blocking */
‍‍‍

Actually it sets SOME socket as non-blocking. I'm assuming the socket being set above is the one used in the following function, but that's a big assumption that you should check.

That's fine (to set non-blocking), as long as the sending code checks for that, which it does:

I'm guessing this is the code you're using, please check and see if it is:

"7.2 REG_ABI 20110524/CF_Lite_v3.2_MVDH_20110524_CW7.2/src/projects/example/freescale_HTTP_Web_Server/freescale_http.c":

void freescale_http_send_file( int session )
{
    ...
    length = emg_web_read( session, scratch_ram_for_send, MAX_BYTES_TO_SEND );      
    }

    if( length )    //FSL As long as greater than zero bytes, read then send data to client
    {       
        ...
        if( bytes_sent >= 0)
        {
            error_delay = 0;    //FSL Reset the delay counter
        }
        else
        {
            if(( freescale_http_sessions[session].socket->error == EWOULDBLOCK ) ||
               ( freescale_http_sessions[session].socket->error == ENP_RESOURCE ))
               //FSL added test for ENP_RESOURCE system error
            {
                error_delay++;          //FSL Increment delay counter
                if( error_delay > 80 )  //FSL If delay counter expired, change http state to CLOSE
                {
                    freescale_http_sessions[session].state = EMG_HTTP_STATE_CLOSE;                  
                    printf( "\n\nStack failed due to blocking" );
                }
            }
            
            // If socket reported a error, or would block,
            // Give some extra time to sleep.
            tk_sleep( 10 );     //FSL The sleep lets transmit buffers and heap space free up (hopefully)
        }       
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I think I can see a problem there.

"emg_web_read()" reads data from the file AND ADVANCES THE FILE POINTER. It then tries to send it, and if sent, so far so good. It will then go and send the next lot. If the socket fills up (with a slower-than-local connection), then it returns "EWOULDBLOCK" and the caller has to SAVE that data and send it the next time.

I can't find any code in there that could do that. It doesn't save the data for next time and it doesn't reset the file pointer on the data it just read. It should do the equivalent of an "lseek()", which is "eng_web_rewind()". This function is used in the "token replace" code, but it should also be used on EWOULDBLOCK recovery.

Without that, when the socket blocks it will throw away one whole read, wait a while and then try again with the next lot. It will throw that away too if the data hasn't been acked.

It looks like this code has only been tested on "fast local and perfect networks" and has never run on the "real internet" before. Congratulations for being the first person to try this code that way in about 12 years, or at least to notice that it doesn't work.

The good news in the above is that the Nichelite code seems to be fine. It is just the Freescale Demo code that has this problem.

Tom

wkurzbauer · ‎08-01-2018

Hi,

I just wanted to reply to your previous post reporting that I located the problem in the http send function and then I noticed your second post where you came to the same conclusion that the problem lies in the demo code in that the stack returns EWOULDBLOCK which is not handled porperly..... :smileyhappy:

As I noticed after setting the http verbose level to 4 that one packet is sometime "thrown away" I simply implemented a loop which repeates sending the chunk until the the m_send function does not return with an error.

However the problem lies in the timing. The tksleep(10) does not seem to have an impact and therefore I added some delay by outputting some information via the serial connection (implementing a for-loop or using cticks did not work but I did not further investigating the task-switch functionality):

........
       //FSL send the data to the client
       msg_flag = 0;
       error_delay = 0;
       do
           {
           bytes_sent = m_send( freescale_http_sessions[session].socket, scratch_ram_for_send, length );
#if HTTP_VERBOSE>4
       printf( "\nses %d, bytes %d ", session, bytes_sent );
#endif

tk_yield( ); //FSL Checks to see if other task(s) need to run

           if( bytes_sent >= 0)
               {
                error_delay = 0;   //FSL Reset the delay counter
               }
             else
                {
#if HTTP_VERBOSE>4
                printf( "Socket error: %d\n",freescale_http_sessions[session].socket->error );
#endif

                if(( freescale_http_sessions[session].socket->error == EWOULDBLOCK ) || ( freescale_http_sessions[session].socket->error == ENP_RESOURCE )) //FSL added test for ENP_RESOURCE system error
                   {
                   error_delay++;           //FSL Increment delay counter
#if 1
                   if (!msg_flag)   // print the delay on the serial
                       {
                       printf( "socket delay:");
                       msg_flag = 1;
                       }
                   printf("%s",((freescale_http_sessions[session].socket->error == EWOULDBLOCK) ? "#" : "*"));
#endif
                   if( error_delay > 80)   //FSL If delay counter expired, change http state to CLOSE
                       {
                       freescale_http_sessions[session].state = EMG_HTTP_STATE_CLOSE;
                       printf( "Stack failed due to blocking\n" );
                       bytes_sent = 1;   //fake to exit the loop in case of closing
                       }
                   }
               // If socket reported an error, or would block,
               // Give some extra time to sleep.
               tk_sleep( 30 );       //FSL The sleep lets transmit buffers and heap space free up (hopefully)
               }
           }
           while (( bytes_sent < 0));
#if 1
           if (msg_flag)
                   printf( "\n");
           msg_flag = 0;
#endif
       }       //end if (length)
   else   //length=0, no data to send so update session control
   {
.....

This solution works but I have to further investigate the case when the EWOULDBLOCK errors remains and the delay counter expires as in extreme cases the above solution still leads to uncontrolled behaviours (illegal instructions... etc..)

I also wonder how this code survived more than 10 years without any correction... :smileywink: But meanwhile I found so many bugs in the code (e.g. handling of too many sessions, keep-alive functionality etc.etc.etc) that I doubt that freescale took this project really serious...

reagrds

Werner

TomE · ‎08-02-2018

> I also wonder how this code survived more than 10 years without any correction..

The code is worked on, an App Note is written and then people move onto the next product, project, app note. If the bugs aren't reported in the first few months I'd be very surprised if there's anyone assigned to work on it. The code dates from 2006. It was worked on up to 2009 by Freescale, and then by Marc up to 2011. And then by you.

> But meanwhile I found so many bugs in the code

Were they in the Demo Example HTTP Server or in the Nichelite code, or both? You can't expect that much from sample code written for an App Note.

I still prefer my fix, have you tried it?

I think all the "Timeout" stuff is there to try and close a connection that disappeared part way through. If the server has sent any data, then the TCP Timeout (and the signal back through the socket) should handle that. Unless you go non-blocking, and then you have to handle all this complexity yourself (as the code seems to be trying to do). Embedded code and the "OS" it is running under probably can't afford one-thread-per-connection as there may end up being a lot of them.

Tom

wkurzbauer · ‎07-10-2018

Hi,

I modified the code according to your suggestion (setting dropsocket = 1 in case of no callback).

However this again lead to a number of undeleted sockets and reducing the heap after each attack.

To my understanding the callback function freescale_http_cmdcb(int code, M_SOCK so, void * data) only closes and deletes the socket (by executing freescale_http_remove( M_SOCK so )) if it was a "VALID_EMG_SESSION".

If not - the callback function is called without closing the socket and the else-branch setting the variable "dropsocket" to 1 is not executed.

Therefore the socket is not deleted after jumping to the label "drop".

I enabled the debug variable netstat and after calling the command "sockets"

a huge number of sockets having no TCPCB were listed...(one can see that the

ip addresses coming from various attackers...):

In Nichlite rev.3.2 issuing the command socket also deletes all sockets having no TCPCB

for (so = (M_SOCK)msoq.q_head; so; so = so->next)
   {
      ns_printf(pio,"%lx, %u.%u.%u.%u, %u->%u, ", /*(long)*/so,
         PUSH_IPADDR(so->fhost), (so->lport), (so->fport));

ns_printf(pio,"0x%x, %u, %u", (unsigned)so->so_options,
(unsigned)so->rcvdq.sb_cc, (unsigned)so->sendq.sb_cc);

      tp = so->tp;
      if(tp)
         ns_printf(pio, ", %ld, %ld, %s\n", tp->snd_una, tp->snd_nxt, tcpstates[tp->t_state]);
      else
      {
         ns_printf(pio, " (no TCPCB)\n");
   m_delsocket(so);  //FSL No TCPCB then delete socket (so)...might want to implement this capability elsewhere    <<<--
      }
   }
   return 0;

Thus after issuing the command my heap again nearly growed to its original size (except one socket in FIN_WAIT_2 state which seems to linger in the socket list forever without being deleted..).

The problem is when can I (automatically) delete all those sockets having no TCPCB without causing any further troubles?

cheers

Werner

TomE · ‎07-10-2018

> To my understanding the callback function freescale_http_cmdcb ...

Maybe it is wrong or wrongheaded. The TCP "closing" code (in its various flavours) has to signal back to the user that a "request to close" has happened or a close has completed. The socket interface to the "user code" (the http server) doesn't support asynchronous notification of this sort of thing, so the only "signal" is a return code to the "read()", "recv()", "send()" and other calls. The user code should then "close()" the socket.

The underlying way it has to work is documented here, and best shown in the diagram on page 23:

https://tools.ietf.org/html/rfc793

An easier to read version is here:

https://en.wikipedia.org/wiki/Transmission_Control_Protocol#/media/File:Tcp_state_diagram_fixed_new....

From the second of the above, if the other end terminates an "ESTABLISHED" connection by sending "FIN", the stack sends "ACK" and goes to "CLOSE WAIT". Somehow it signals this state to the user that then performs the "CLOSE", the stack send a "FIN" and waits for an "ACK". Note it will stay there "forever" according to the diagram and the original protocol. In practice there has to be a time out here. If the User Code sends a "CLOSE" to "ESTABLISHED", then you have to end up in "TIME_WAIT". If you ask Google "how long is TCP TIME WAIT timeout" it'll tell you "4 minutes".

If you're in "FIN WAIT 2" then it means your user code issued "CLOSE", the stack send FIN, received the ACK and from "FIN WAIT 2" is waiting for the other end to play nice and send its "FIN" to which this stack will send "ACK". The other end isn't playing nice. Its stack has signaled the event at its end, and is waiting for that user code to call "CLOSE", and it isn't. If it is an "attacker" there's no reason for it to play nice. There's another path not shown on that diagram. If your code has sent data or the FIN it will retry it until it gets the ACK or until there's a very long timeout, at which point it'll abort somehow. But it can legitimately stay in "FIN WAIT 2" "for ever". People have added timers to get out of this jail. You'll have to check the Nichelite code to see if it has one of these. For reference, here's something Microsoft did:

https://docs.microsoft.com/en-us/windows-hardware/drivers/network/fin-wait-2-timer

Here's the hooks under Linux:

https://serverfault.com/questions/7689/how-do-i-get-rid-of-sockets-in-fin-wait1-state

> The problem is when can I (automatically) delete all those sockets having no TCPCB without causing any further troubles?

The problem is "how did they get like that?". There's something wrong with the stack or user code that is doing that.

> m_delsocket(so);  //FSL No TCPCB then delete socket (so)...might want to implement this capability elsewhere

When Freescale worked on this code they added at least 704 "FSL" comments as they went about changing the code. So you "might want to implement this capability elsewhere".

This comment might be useful:

//FSL **have newly created so socket have pointer to newly created tcpcb
//FSL Note: there is always a socket/tcpcb pair and they have pointer to each other

That may mean if there's a socket with "so->tp" being NULL then it is safe to delete it. As long as it isn't on a timer queue or something. Make sure the "socket delete" function stops all timers and removes it from all queues.

Tom

wkurzbauer · ‎07-11-2018

Hi,

After studying your hints and advices I finally decided to implement a socket purge function which is called by the taskmanager every xx seconds (I chose 20min).

The function removes all sockets without TCPCB and all sockets in FIN_WAIT_2 (whereby at the first call the

waiting sockets are only marked (i.e. put in a list) and finally deleted after CNT_DOWN times calling the purge function.

Up until now I was able to keep the heap.... :smileywink:

/* FUNCTION: purge sockets()
*
* PARAM1: void * pio
*
* RETURNS: nothing
*/
#define MAX_STUCK_SOCKETS 5
#define CNT_DOWN 2

typedef struct stuck_socket
   {
   M_SOCK socket;
   int timeout;
   } SOCKET_TIMEOUT_STRUCT;

SOCKET_TIMEOUT_STRUCT stuck_socket_list [MAX_STUCK_SOCKETS];

void purge_socket_list_init (void)
   {
   int i;
   for (i=0;i < MAX_STUCK_SOCKETS; i++ )
       {
       stuck_socket_list[i].timeout = 0;   //initialize list
       stuck_socket_list[i].socket = 0;
       }
   }

int purge_sockets (void * pio)
{
   M_SOCK so;
   struct tcpcb * tp;
   char i,found = 0;

   ns_printf(pio,"purging sockets...\n");
   if (msoq.q_len == 0)    //nothing to do
   {
      ns_printf(pio,"No TCP sockets...\n");
      return 0;
   }

   for (so = (M_SOCK)msoq.q_head; so; so = so->next)   // loop through the socket chain
       {
       tp = so->tp;
       if(tp)   // check TCBCB
           {
           if (tp->t_state == TCPS_FIN_WAIT_2)   //check wait state
               {
               found = 0;
               for (i=0;(stuck_socket_list[i].socket != so) && (i < MAX_STUCK_SOCKETS); i++)
                   {
                   if (stuck_socket_list[i].socket == so)

{

                                 if (stuck_socket_list[i].timeout == 0)
                                       {
                                       m_delsocket(so);
                                       ns_printf(pio,"Socket %lx, IP: %u.%u.%u.%u deleted (FIN_WAIT_2)\n",so,PUSH_IPADDR(so->fhost));
                                       stuck_socket_list[i].socket = NULL;

                                       }
                                 found = 1;
                                 break;

                                 }
                   }
               if (!found) // not yet in the list -> put stuck socket in the list
                   {
                   i=0;
                   while ((stuck_socket_list[i].socket != 0) && (i < MAX_STUCK_SOCKETS))
                       i++;           // find a free list entry
                   stuck_socket_list[i].socket = so;
                   stuck_socket_list[i].timeout = CNT_DOWN;
                   ns_printf(pio,"Socket %lx, IP: %u.%u.%u.%u pending (FIN_WAIT_2)\n",so,PUSH_IPADDR(so->fhost));
                   }
               }
           }
       else
           {
           m_delsocket(so);
           ns_printf(pio,"Socket %lx, IP: %u.%u.%u.%u deleted (no TCPCB)\n",so,PUSH_IPADDR(so->fhost));
           }
       }

   for (i=0;i < MAX_STUCK_SOCKETS; i++ )
               {
               if (stuck_socket_list[i].timeout == 0)   //delete all orphaned timed out sockets from list
                   stuck_socket_list[i].socket = 0;
               else
                   stuck_socket_list[i].timeout--;       // decrement "waiting" list elemets
               }

   return 0;
}

cheers

Werner

TomE · ‎07-11-2018

Congratulation on finding a fix for this.

Could I also suggest that your function count the number of those sockets that should be deleted, and trigger a purge when the number gets too high as well? If your device comes under "sustained attack", then it could run out of memory in a matter of seconds.

Tom

wkurzbauer · ‎06-28-2018

Hi,

I enabled "HEAP_STATS" to get the heap information but strange enough, apart from the enabled functionality,

every time when a web-page is loaded from my freescale server the following message appears 3 times:

Access error -- PC = 0x00006F66

Error on operand write

Appartently the mcf5xxx_exception_handler (void *framep) in mcf5xxx.c is called...

The reported address is in the module TCPIN.C

in the function "tcp_xmit_timer(struct tcpcb * tp)" at address 0x00006F66:

;
; 1384:    TCPT_RANGESET(tp->t_rxtcur,
; 1385:        (short)(((tp->t_srtt >> 2) + tp->t_rttvar) >> 1),
; 1386:        TCPTV_MIN, TCPTV_REXMTMAX);
;
0x00006F4C 0x303C0080               move.w   #128,d0               ; '..'
0x00006F50 0x3F40000A               move.w   d0,10(a7)
0x00006F54 0x7002                   moveq    #2,d0
0x00006F56 0x3F400006               move.w   d0,6(a7)
0x00006F5A 0x202E0074               move.l   116(a6),d0
0x00006F5E 0xE480                   asr.l    #2,d0
0x00006F60 0xD0AE0078               add.l    120(a6),d0
0x00006F64 0xE280                   asr.l    #1,d0
0x00006F66 0x3F400002               move.w   d0,2(a7)        <<<<<<<<<<<<<<<<<<<<<<<<<<<<
0x00006F6A 0x4EB900005ECC           jsr      0x00005ecc            ; 0x00005ecc
0x00006F70 0x48C0                   ext.l    d0
0x00006F72 0x2D400020               move.l   d0,32(a6)
;
; 1387: }
;

I do not know why this exception is only shown when "HEAP_STATS" is enabled and I do not know wheter this

is a compiler problem or a code problem. However it might well be one step towards a solution of the problem...

May I ask you if you have any idea...

cheers

Werner

TomE · ‎06-29-2018

This is all standard debugging. Do you have a debug pod? Can you single-step? Can you put a breakpoint on that instruction and dump the stack and the registers to see what's going on?

You haven't said if you have turned NPDEBUG and/or MEMIO_DEBUG on. Have you?

So this only happens with "HEAP_STATS".

I don't have the Nichelite code easily viewable, but you should do what I would. Which is to look at the sources of the multiple layers of the memory allocator, and then see what CHANGES in that code when you turn the debugs on.

I remember that one of the debugs added "poison" or a "pattern" to freed blocks. That is to see if something is writing to a block after it has been freed. It is likely that your code is READING from a block after it has been freed, and it getting upset because it is reading the "poison" value. This doesn't happen without the debug flags as it is probably reading "Stale, but legitimate" data.

The code is in "tcp_xmit_timer()". It is possible that the TCP connection creates a "Transmit Timer" and puts it on a "Timer Queue" to be activated. When the socket closes it is important/essential that that close cleanly removes all references to freed data, including anything it puts on a timer queue. Maybe it doesn't. You'll have to read the code and draw up a lot of flowcharts, or single-step (a lot) or add debug printing to find this.

The error you're getting is strange. The code is setting up to call a function, and is pushing four parameters onto the stack, as 128 -> 10(A7), 2 -> 6(A7) and the sum in the middle to 2(A7). There must be a fourth one you didn't include in the code you quoted. Anyway, if the first two worked, then I can't see how the third one would fail. They're all storing to addresses that are very close (on the stack). It isn't as if one can be a "wild pointer" different to the other ones.

See if you can make the Exception Handler tell you something more useful than what it is doing. Read through the "Exception Handling" chapter of the CPU manual. It would be REALLY useful if the handler could tell you WHICH exception happened. Just printing the "Format Vector Word" would help. If it doesn't to this you should be able to add code to make it do so.

And do all the other things I suggested, like printing statistics and counters so you can see leaks as they happen.

(Edit)

> Anyway, if the first two worked, then I can't see how the third one would fail.

To work that out you have to read the "Coldfire Core" Chapter in the Reference Manual, specifically the "Access Error Exceptions" section, as that would seem to be what you're getting. It said "Access Error" and there are 10 different codings in the Exception Word, four of which are valid, and one matches "Error on Operand Write". For the CFV2 though, the section says:

The V2 ColdFire processor uses an imprecise reporting mechanism for access errors on operand
writes. Because the actual write cycle may be decoupled from the processor’s issuing of the
operation, the signaling of an access error appears to be decoupled from the instruction that
generated the write. Accordingly, the PC contained in the exception stack frame merely represents
the location in the program when the access error was signaled.

So the one thing I can saw with some confidence is that the Write you have marked at "6F66" wasn't the one that failed.

Still, there were TWO previous writes to the stack (memory referenced by A7) and the CFv2 doesn't have a deep enough write pipeline for it to have blown up three or more writes earlier, so there must still be something wrong with the stack address.

Another problem is that it is very hard to get an "Access Error" on the Coldfire. All of memory is mapped, there's no Memory Management unit. You're (probably) not running with Supervisor and User mode, so there's no "memory you can't get to". It doesn't usually flag accesses to address spaces with no memory present either.

You need to drop into the debugger and see what A7 is. You also want to get it to give you a register dump at the time of the exception.

It is possible your system is configured with a small stack, and the extra printing that "HEAP_STATS" is doing is walking off the end of the stack, causing corruptions. This is another fault condition that CPUs with memory management catch for you, but simpler ones like the CFV2 let you get wrong without giving you any help at all.

Tom

wkurzbauer · ‎07-06-2018

Hi Tom,

Thanks for your useful advices.

After some investigations if think I could localize the problem.

First of all the statics variable in memio.c (mh_totfree) is wrongly set in the mem_free() function and has a negative overflow after some time (the length of the header structure is not properly registered in the statictics in case of blocks are freed or allocated which are not at the end of the heap).

The heapstat monitor (mh_stats()) therefore reports false values.

However the heap as such seems to work properly.

My problem is the freescale HTTP server:

When my server is connected to the net it is sometimes "visited" by some unwanted guests (probably IP-scanners) and if this is the case the memory after the "attack" is not properly freed.

It seems that my bowser sents is able to send a response without getting an error fom m_send (M_SOCK...) but then the session is not properly terminated and I assume the socket is not properly closed or something like that?

So the available hep gets smaller and smaller and after some time the system collapses....

Now I have to dive into the freescale server code to really find the problem (I again hope that the TCP/IP stack operates properly). As the attacks are not predictable this can take some time.....

May I ask if this problem is known to you?

cheers

Werner

TomE · ‎07-06-2018

> The heapstat monitor (mh_stats()) therefore reports false values.

Have you fixed it so you can trust it?

> My problem is the freescale HTTP server:

> When my server is connected to the net it is sometimes "visited" by some

> unwanted guests (probably IP-scanners) and if this is the case the memory

> after the "attack" is not properly freed.

Are you sure that's what is happening? I suspect there is no bug.

But what should you do if the server does have a bug that leads to a memory leak?

Guess what Apache does? The last time I looked at the code (10 years ago, so it might have changes), an Apache server spawns multiple threads forming a "pool". Each one is allocated to incoming client requests. After a client has responded to about 100 requests, IT IS KILLED AND RESTARTED! That is because it is assumed that it will leak, and rather than actually fix the bugs it is easier to simply kill it and clean up any mess. Of course the servers are running on top of an OS that keeps a separate memory pool per process and keeps a list of all file handles and sockets. So everything is cleanly released when the "kill" happens.

This is a lot harder to do on embedded systems!

You should be able to tell from the memory statistics how big each "leak" is. Then compare that "leak size" with all the "request sized" looking for a match. You could even log every allocation and free, logging the caller function address. That would let you know what code isn't freeing when it should.

A standard "problem" (actually design characteristic) of TCP/IP is that a connection is expected to stay open FOR EVER unless deliberately closed. And it will stay open without traffic and without sending any data. This is not a bug.

There is a "Feature" of standard (large, on Linux and Windows) TCP/IP systems called "Keepalive". There's a conflict between selfish programmers who would like "keepalives" every few seconds and the "Old Net Gods" who rightly said "that doesn't SCALE" and so required the MINIMUM time for the keepalives to be 2 HOURS:

Keepalive - Wikipedia

So I suspect your "attacks" are performing HTTP Opens which are first causing TCP Opens, and then either aren't sending any data, or are sending a bit and then go off to attack something else without closing the connection.

The only thing that will cause the connection to drop is either a deliberate timeout in the server, or something that makes the server want to send some data on the connection. The retries on the send will eventually close the connection, but NOTHING ELSE WILL.

The proper way to monitor this is to add a command to the system (maybe even a web page) that lists or counts all the open connections. Linux machines have "netstat -s". Windows has "netstat". You want to write an equivalent for your system. Especially if you can add a column listing how long the socket has been open for. Then you can see if the number of open sockets is increasing without limit (and taking up memory without limit).

Maybe modern servers have timeouts to handle this problem. Maybe the Freescale one is so old (and meant as a demo rather than a bulletproof on-the-internet server) that it doesn't have any sort of timeout, or it does, but it isn't turned on.

Is this the one you're running, or is it another one? Where's the source?

"Freescale\NicheliteColdfireLite\7.2 REG_ABI 20110524.zip\7.2 REG_ABI 20110524\CF_Lite_v3.2_MVDH_20110524_CW7.2\src\projects\example\freescale_HTTP_Web_Server"

You could rely on KEEPALIVEs to close sockets. But you might run out of them or memory way before two hours is up.

Here's how to configure it in the 2009 version, with a recommendation on dropping the timeout from 2 hours to as low as a minute (for embedded systems). It even includes source code and documents a bug in that version that means it won't work properly with Windows, but it also gives a patch to fix that.

https://www.nxp.com/docs/en/application-note/AN10775.pdf

Here's a post from 2010 saying it doesn't time out. But the original poster may not have turned it on. There's a post in there from Marc on his fixes to Nichelite too. I've searched the Freescale Nichelite documentation and didn't get a match on "keepalive" (or even "keep" or "timeout").

Here's what happens if you search the code for keywords:

$ find . -type f | xargs grep KEEPALIVE
./NicheLite/Source/h/msock.h:#define SO_KEEPALIVE 0x0008 /* keep connections alive */
./NicheLite/Source/mtcp/tcp_timr.c: if ((((M_SOCK)(tp->t_inpcb))->so_options & SO_KEEPALIVE) &&

$ find . -type f | xargs grep TCPT_KEEP
./NicheLite/Source/h/mtcp.h: * The TCPT_KEEP timer is used to keep connections alive. If an
./NicheLite/Source/h/mtcp.h: * an ack segment in response from the peer. If, despite the TCPT_KEEP
./NicheLite/Source/h/mtcp.h:#define     TCPT_KEEP      2     /* keep alive */
./NicheLite/Source/mtcp/TCPAPI.C:   tp->t_timer[TCPT_KEEP] = TCPTV_KEEP_INIT;   /* initial connect keep alive */
./NicheLite/Source/mtcp/TCPIN.C:   tp->t_timer[TCPT_KEEP] = tcp_keepidle;
./NicheLite/Source/mtcp/TCPIN.C:         tp->t_timer[TCPT_KEEP] = TCPTV_PERSMAX;                //FSL was TCPTV_KEEP_INIT;
./NicheLite/Source/mtcp/tcp_timr.c:   case TCPT_KEEP:   //FSL case 2
./NicheLite/Source/mtcp/tcp_timr.c:         tp->t_timer[TCPT_KEEP] = (short)tcp_keepintvl;
./NicheLite/Source/mtcp/tcp_timr.c:         tp->t_timer[TCPT_KEEP] = (short)tcp_keepidle;

./NicheLite/Source/h/mtcp.h:#define TCPTV_KEEP_INIT (75*PR_SLOWHZ) /* initial connect keep alive */

./NicheLite/Source/h/mtcp.h:#define     PR_SLOWHZ   2 /* TCP ticks per second */
./NicheLite/Source/h/mtcp.h:#define TCPTV_SRTTDFLT    (3*PR_SLOWHZ)        /* assumed RTT if no info */
./NicheLite/Source/h/mtcp.h:#define TCPTV_PERSMIN     (5*PR_SLOWHZ)        /* retransmit persistance */
./NicheLite/Source/h/mtcp.h://#define TCPTV_PERSMAX     (60*PR_SLOWHZ)       /* maximum persist interval */
./NicheLite/Source/h/mtcp.h:#define TCPTV_PERSMAX     (10*PR_SLOWHZ)       /* maximum persist interval */      //FSL lowered
./NicheLite/Source/h/mtcp.h:#define TCPTV_KEEP_INIT   (75*PR_SLOWHZ)       /* initial connect keep alive */
./NicheLite/Source/h/mtcp.h:#define TCPTV_KEEP_IDLE   (120*60*PR_SLOWHZ)   /* dflt time before probing */

There's no demo code that sets that socket option anywhere. Maybe you should add it to the HTTP server where it opens its sockets. And you might like to change the "two hours" definition above for TCPTV_KEEP_IDLE to something quicker. Except there's a "FSL" comment in there that may have changed how this worked.

The simplest way around all of these problems is to just reset the box periodically, or to have a monitor check memory and sockets and reset if it is about to run out. If it already crashes when it runs out of memory then you've already done that :-).

That's assuming it is a TCP problem. It may still be a bug as you've said. In which case it should be easy to find and fix.

Tom

wkurzbauer · ‎07-07-2018

Hi,

I think I localized the problem:

In TCPIN.C in the function tcp_rcv (PACKET pkt)

the following code makes the problem :

if (tiflags & TH_RST)
   {
      switch (tp->t_state)
      {

      case TCPS_SYN_RECEIVED:
         so->error = ECONNREFUSED;
         goto close;

      case TCPS_ESTABLISHED:
         TCP_MIB_INC(tcpEstabResets);     /* keep MIB stats */
      case TCPS_FIN_WAIT_1:
      case TCPS_FIN_WAIT_2:
      case TCPS_CLOSE_WAIT:
         so->error = ECONNRESET;
         close:
         tp->t_state = TCPS_CLOSED;
         TCP_STAT_INC(tcps_drops);
         m_tcpclose(tp);
         if (so->callback)
            so->callback(M_CLOSED, so, NULL);
         GOTO_DROP;      <<<<<<<<<<<<<<<<<<<<<<<----------------------------------------------

      case TCPS_CLOSING:
      case TCPS_LAST_ACK:
      case TCPS_TIME_WAIT:
         m_tcpclose(tp);
         GOTO_DROP;
      }
   }

on the marked location the code jumps to the label "drop":

drop:
tcp_pktfree(pkt);

   /* destroy temporarily created socket */
   if (dropsocket)
      m_delsocket(so);
   return SUCCESS;

It seems that if dropsocket is 0 (and this apparently the case) the socket is not properly deleted. I assume that this is in fact the goal in the above location !?

I changed the code in that I do not jump to the label included

tcp_pktfree(pkt);

m_delsocket(so);

return (0);

directly in the above case option.

I do have to admitt that I do not fully understand what happens but I hope this fixes the problem...

Maybe you know more...

cheers

Werner

TomE · ‎07-08-2018

That doesn't look right.

Looking to see what "dropsocket" is really for, at line 441:

      /*
       * Mark socket as temporary until we're
       * committed to keeping it.  The code at
       * ``drop'' and ``dropwithreset'' check the
       * flag dropsocket to see if the temporary
       * socket created here should be discarded.
       * We mark the socket as discardable until
       * we're committed to it below in TCPS_LISTEN.
       */

What that tells me is that the code is structured to get a socket, and then hand it off to the next level of code, making it IT'S RESPONSIBILITY to dispose of it at the right time. By calling "m_delsocket()".

So search for everywhere else in the code where "m_delsocket()" is called, and you'll find it being called in m_close().

If you dispose of it in "tcp_rcv()" when something else is still using it, you'll risk bad memory corruption when the socket is freed the second time.

So something, possible the HTTP Server isn't following the rules, and may not be calling through to m_close() properly.

It is likely that some simple code was written for Linux or Windows, and rather than closing the socket properly under all circumstances (normal close and error cases) it just bails or aborts or something. On Linux and Windows, when a program does that, the OS cleans up after the program so that a messy, but OK way to write code. There is no OS to do that here - the program has to follow the rules and shut down everything cleanly.

Looking again at the code you quoted:

if (tiflags & TH_RST)

That only happens when a socket is (or was) open and receives a RESET segment from the remote.

What is MEANT to happen is this triggers:

         if (so->callback)
            so->callback(M_CLOSED, so, NULL);
         GOTO_DROP;

The "callback" is meant to be there, and IT is meant to cleanly close the socket. If it is being called and it doesn't close the socket, then that's the bug. If there is no callback registered, and so there's nothing there to perform that operation, then the bug is that it got to here with no callback and nothing closing the socket.

I think a better change would be to change the above to catch this possible case:

         if (so->callback)
            so->callback(M_CLOSED, so, NULL);
         else
            dropsocket = 1;
         GOTO_DROP;

Also add some sort of debug printing code to the above added line (remembering to add braces), to see if that is actually happening. Then you can see if that fixes the problem.

Tom

wkurzbauer · ‎06-27-2018

Hi,

Thanks for the responses. The digibutler board uses the MFC52231 and was originally supplied with CW 6.3, the freescale_HTTP_Web_Server by Eric Gregori and cpu definitions tailored to the board.

I used CW7.1 to compile my code and I hoped that this CW version can also be used with the rev.3.2 of nichelite as long as I use the MFC52231 (I do not know if rev.3.2 still supports MFC52231)?

I tried to install a CW7.2 version though...but it still denies to install on my 64-bit machine :smileysad:

@tom: I was a little sloppy concerning the functions reporting the error:

these are calloc1() in MEMIO.C and npalloc() in iutil.c

I will follow your suggestion to replace the memory allocator....

However as I used the freescale_HTTP_Web_Server by Eric Gregori (still available at NXP) I hoped that such severe bugs do not occur. Also in the forums the issue is only rarely discussed and in fact no solution provided,,,,

When I went through the code changes in rev.3.2 I noticed that FSL (I don't know for who the shortcut stands for) apparently modified and revised the freescale ("Gregori") version and not a "generic" interniche version. So maybe the bug is still not removed...

cheers

Werner

TomE · ‎06-27-2018

> these are calloc1() in MEMIO.C and npalloc() in iutil.c

ColdFire_Lite_M52259EVB/src/projects/NicheLite/Source/cf_specific/iutil.c

ColdFire_Lite_M52259EVB/src/projects/NicheLite/Source/misclib/MEMIO.C

Enable "NPDEBUG" in MEMIO.C and see if it finds anything if you haven't done that already. Also "MEMIO_DEBUG".

Also note that "npalloc():" is documented as "Wrappers for heap calls, with memory clearing and counters", so write some code to print those counters. The only current way is to enable "HEAP_STATS" which prints on every alloc and free. That may overload your system, or it may be the way to debug this.

The most likely problem is a memory leak. Something is allocating memory and not freeing it. On a system with a lot of free memory you either have to actually TEST code that uses malloc() and free() to make sure it isn't leaking, or you have to wait a long time until all the memory is used up. Or you can change the heap allocation function (mheap_init()) to return an artificially small heap to force these problems to happen sooner. It is likely nobody has done that with some or all parts of this code, so that's now your job.

Also enable "HEAP_STATS" and periodically (once a minute should be enough) call "mh_stats()". I would suggest you improve the debugging by adding code to print the "npalloc()" wrapper counters at the same time. That should tell you if you're suffering from a "memory leak" where some code is allocating a block and then never freeing it. You should also be able to see if a "leak" corresponds to a particular operation on the stack or the web server.

Better still, why not generate a Web Page that gives all of the current memory statistics? Then you can monitor it in real time!

For these problems that don't happen all that often, it is likely the "leak" is in an error case, like a bad or aborted web page access, or a TCP socket that is aborted rather than properly closed. Windows has a habit of not closing TCP connections properly, which can cause embedded systems to run out of memory or sockets with all the connections in "Timeout" state.

> FSL (I don't know for who the shortcut stands for) apparently modified and revised the freescale

https://en.wikipedia.org/wiki/FSL

It's Freescale's New York Stock Exchange Ticker symbol too.

Tom

TomE · ‎06-26-2018

3.2 is the latest version on Freescale's web site, dated Feb 2009.

I have a version, still 3.2, but with "LIST OF CHANGES MADE BY MARC VANDENHENDE, 2011-05-24".

Here's a link to a post in this forum with a link to a copy on Dropbox, dated November 2013

https://community.nxp.com/message/101060?commentID=101060#comment-101060

Not surprisingly, it isn't on Dropbox any more.

So I've attached it to this post. I don't know if it fixes your problem. From a quick look at the "Changes" document I couldn't see anything that obviously matched.

You may have to debug it, but at least you'll be starting from something with less bugs.

Watch out for the big and horrible "CW 7.2 ABI change" which might affect this code.

If you search this forum for "nichelite" you'll find about 56 posts. Have you looked through them, and did you find anything there?

Tom