AN3470SW CW 6.3 memmove / tk_switch problem

DarrenSteadman · ‎10-17-2007

This is a bit of a strange one and it has taken me a while to debug things this far. (I'm totally new to Coldfire)

I've been using the demos from the AN3470SW app note to create an application that will talk to some of our existing equipment via serial and pass it along to a PC via ethernet, so a basic serial to ethernet converter. To get this working properly I had to modify the example slightly as well as the size of some of the TCP buffers to get optimal through put for our application.

I then added the HTTP server example to the code so I could work on a way of configuring the device.

This is when things went wrong. If I try to stream serial data through the device the RTOS crashes due to a stack gaurd word on the "Main" task being overridden.

I set a write watchpoint on the guardword and it stops in _tk_switch

_tk_switch:
move.w #0x2700,SR /* disable ints */

   move.l   4(A7),D0         /* save passed tk pointer in D0 */
   move.l   D2,-(A7)         /* push the non-volitile gp registers */
   move.l   D3,-(A7)
   move.l   D4,-(A7)
   move.l   D5,-(A7)
   move.l   D6,-(A7)
   move.l   D7,-(A7)

move.l A1,-(A7) //Stops here

The call stack shows that irq_handler is being called continuously and it is trying to print something to the console. I never actually see this output so I assume that the UART is not being processed because so much time is being spent in the IRQ.

I removed the printf statement to see what would happen and tried again. This time the application didn't crash, however our PC app stopped receiving data completely. If I stopped execution of the coldfire it was nearly always in the interrupt handler.

I then decided to put a break point in the interrupt handler to see when it was being called. If I run the application but don't make a TCP connection to it then the interrupt handler never gets called, however as soon as I make a connection the interrupt handler starts being called.

The interrupt handler always gets called for the first time when the main program is in a memmove operation which is called from "arprcv"

struct arp_hdr * arphdr;
   struct arptabent *   tp;
   arphdr = (struct arp_hdr *)(pkt->nb_buff + ETHHDR_SIZE);

   {
      struct arp_wire * arwp = (struct arp_wire *)arphdr;
      MEMMOVE(&arphdr->ar_tpa, &arwp->data[AR_TPA], 4);   //Could potentially be this line
      MEMMOVE(arphdr->ar_tha, &arwp->data[AR_THA], 6);     //Debugger shows it has stopped here
      MEMMOVE(&arphdr->ar_spa, &arwp->data[AR_SPA], 4);
      MEMMOVE(arphdr->ar_sha, &arwp->data[AR_SHA], 6);
   }

The memory addresses of the variables being used in memmove always seem to be the same as well.

The interesting thing is the "Value" (in debugger) of the dest and src pointer are the same.

Is something going wrong in "arprcv" that is causing a problem with the memmove that could then trigger an interrupt?

Is there any code I could put in the interrupt to try and find out the source of it?

Has anyone else had the same kind of problem?

Thanks for your time

Darren

DarrenSteadman · ‎10-17-2007

I've looked at the code more closely and the section below that was causing the problem is doing something funny.

arphdr = (struct arp_hdr *)(pkt->nb_buff + ETHHDR_SIZE);

   {
      struct arp_wire * arwp = (struct arp_wire *)arphdr;
      MEMMOVE(&arphdr->ar_tpa, &arwp->data[AR_TPA], 4);
      MEMMOVE(arphdr->ar_tha, &arwp->data[AR_THA], 6);
      MEMMOVE(&arphdr->ar_spa, &arwp->data[AR_SPA], 4);
      MEMMOVE(arphdr->ar_sha, &arwp->data[AR_SHA], 6);
   }

From what I can see it is essentially copying data over the top of itself. If I watch the values inside arphdr in the debugger they never actually get changed by the memmove. If I comment out that entire block then everything seems to run perfectly fine.

Could someone tell me why this is there in the first place. I assume it must be doing something. I don't want to remove it permently if it is actually required for some reason.

Message Edited by Darren Steadman on 2007-10-17 04:17 PM

mccPaul · ‎10-18-2007

I suspect that the only reason for this code being there is to make sure that the data received from the FEC is in the correct places in the arp_hdr struct. As this code is supposed to work on several platforms and with several compilers, you can't just assume that the data off the wire will map directly to the struct.

As to why you get a problem with the task stack guardword, each task has a separate stack. The stack is small and so you have to be careful not to cause a stack overflow. This can happen with code that is too recursive, or if you have code that uses many or large automatic variables. Automatic variables are the local variables that you define in a function. They are created on the stack.

Do you have any large arrays that are created as automatic variables, rather than allocating them on the heap? Or do you have any recursive code or loops that are likely to be creating large numbers of automatic variables?

The fact that your code fails in the NicheLite code is likely to be a side effect of this - there are some problems in the NicheLite code, but the ARP code is exercised a huge amount so it's unlikely to have any serious problems. If you want to see a bug look at memcmp in stdlib .c.

Cheers,

Paul.

Edit: spelling

Message Edited by mccp on 2007-10-18 11:47 AM

DarrenSteadman · ‎10-18-2007

After debugging the system further I found that the guardword overflow is because the default interrupt handler gets called continuously and it uses printf to display something on the console. The problem is the interrupt gets called so often that the UART never has a chance to flush its buffer and send the data. This is when the overflow happens.

The interrupt only starts to be called when the memmove has been called a resonable number of times. What could be happening in the memmove that could be setting off the interrupt?

If I remove the printf statement from the interrupt handler the crash never happens however the program slows to a crawl because it is spending all of it's time in the interrupt handler.

My main question is, what is happening in the memmove that causes the interrupt to be called, what is the interrupt and why is it being called so often?

Is there any code I could put into the interrupt handler to determine what is causing it?

mccPaul · ‎10-18-2007

If the problem occurs only after the memmove, then it is possible that the memmove is causing an error exception. Because the exception handler doesn't remove the cause of the exception then it will continuously re-occur.

What do you mean by "default interrupt handler"?

If you break in an interrupt handler, you can look at the exception frame on the stack to find out what caused the exception. The first 32 bits word on the stack will be the exception format, fault status, vector and status register, and the second 32 bit word is the return address. See the Coldfire Programmers Reference manual chapter 11.

Paul.

DarrenSteadman · ‎10-19-2007

By default interupt handler I mean the following function that is in int_handlers.c

__interrupt__
void irq_handler (void)
{
/*
* This is the catch all interrupt handler for all user defined
* interrupts. To create specific handlers, create a new interrupt
* handler and change vectors.s to point to the new handler.
*/
printf("irq_handler\n");
}

If you look at the interrupt vector table this function is used for all the interrupt vectors that don't have a handler. Unfortunatly there is over 30 of these vectors that use this function, therefore I'm looking for a way of finding out which vector caused the interrupt.

DarrenSteadman · ‎10-19-2007

I've been looking into the problem further now and I changed some buffer sizes by a few bytes. This has had the effect of making the problem occur in a completely different place however it is the same general problem.

At some point an interrupt is being activated that uses irq_handler and it gets called continuously.

Any ideas what it could be?

DarrenSteadman · ‎10-19-2007

I'll have a look at the exception stuff you recomended.

Thanks for all the help.

DarrenSteadman · ‎10-19-2007

Sorry if I'm getting annoying here. Is there any chance you could tell me how to see the exception frame in code warrior?

I have the standard stack window up. When my break point hits I get a stack output but it looks like it's showing the stack calls in the main application and then just the irq_handler function. It doesn't give me any information about what called irq_handler. I assume that's what the exeception stuff you said about will show.

I can't find any info in the help docs for code warrior on how to get that information. This is only the second embedded system I've had to work on and the first was ARM based and we had to write everything from scratch.

DarrenSteadman · ‎10-19-2007

Ok found the exception stack frame definition in the documentation. Found the PC and SR registers in code warrior however the 16 bits that contain the vector that was called does not seem to be listed anywhere.

Which section of the registers view should it be under?

mccPaul · ‎10-19-2007

Hi

Do you mean SP? SR is the status register, you want the stack pointer. I don't use CodeWarrior, I use GCC so I can't help with CW specifics. Examine the memory that SP is pointing to, to see the stack.

When you look at the stack, bear in mind that the compiler will have added its own frame. That means that although you break on the first C instruction in the ISR, the compiler may have added instructions that have added stuff to the stack since the exception module created the exception stack frame.

In my GCC, the compiler reserves 20 bytes of the stack to push some registers on - you may need to look at a disassembly view to find out exactly what CW is doing.

Paul.

Kremer · ‎10-19-2007

This seems to be a RAM memory problem. As Paul said some posts ago and since the problem happens right after a TCP connection had been attempted. I think nichelite port uses a standard stack size of 2048 bytes/task. So, the more tasks you have, less RAM you´ll have for heap memory, and this heap memory is the RAM used for TCP lilbuffs and bigbuffs.

So, when you say the problem happens when you try to connect to the board, probably you don´t have enough heap memory for a bigbuff allocation and that´s when some task stack RAM space is violated and the exception is generated. I think this shouldn´t happen, but maybe the port has some bugs on that part.

Anyway, nobody is perfect, Eric did an excelent job on this and i´m using this same port as you. I had similar problems wich i solved carefully looking on tasks stack and researving enough heap memory for my applications.

Take a look on it

Hope it helps

Regards

DarrenSteadman · ‎10-19-2007

I'm beginning to think my problems are elseware. I'm going to go back to the drawing board and see if I can work out what's going wrong. Thanks for all the help guys, it's good to know there are some people out there with the knowledge if everything goes wrong.

TrevorCurry_eu · ‎10-23-2007

Hi,

I have just found this thread and am very interested in the final resolution as my application and problems are very similar.

As part of my debugging I have also come across the irq_handler being called unexpectedly and never got to the bottom of it. As mentioned here, the actual printf output never appears on the console.

My current problem is that the "Main" task just stops being awoken from the fec receive handler - even though the external client is still sending.

I have also seen conditions where the console is still running and the 'socket' command indicates many sockets that are permanently in CLOSE_WAIT.

I am currently working on the assumption that I have a memory allocation problem and am trying to fine tune the BIGBUFSIZ and NUMBIGBUF settings. This is time consuming since it can take an hour before everything stops.

I will look at my use of local variables to see if there is anything to be gained there.

Any further suggestions would be most welcome.

Many thanks for the help so far,
Trevor

TrevorCurry_eu · ‎10-24-2007

Some further input on my problem:

Having reduced the amount of local (stack) variables I use and changed the buffer sizes, the whole nature of my application has changed.

I no longer see the lock up and the irq_handler calls also seem to have gone away. This all leads me to think the basic problem is a memory overflow not being trapped. Checking mh_totfree and mh_minfree now indicates plenty of free memory.

My persisting CLOSE_WAIT states were due to me not handling an ESHUTDOWN error returned from m_recv() - whoops!

Trevor

DarrenSteadman · ‎11-21-2007

Hi guys,

I'm still having strange problems with the irq_handler. Basically it looks like all my problems before were based around RAM in one way or another.

My problem now is I've been adding a fair bit of code to my project and once again the irq_handler problem has come up.

I was wondering if there is any way of reliably tracking down what is calling the IRQ. From what I can see in the IRQ controller registers all vetors that are using irq_handler are masked off so the IRQ controllers can not call the irq_handler function which would indicate that something else is calling it. Does anyone have any ideas as to how I would go about finding out what is making the jump into the handler?

I've checked my heap when the handler starts getting called and I have about 3k of free RAM the lowest the allocation goes is 3k and there are no reports of any failed allocations.

Could this be a buffer overflow or something along those lines? If so is there a way to get the default exception handlers to tell me there was an invalid address write or something like that?

At the moment it's a bit like trying to find a needle in a hay stack. Unfortunatly I've got no idea where to start to try and track down whats going wrong

mccPaul · ‎11-21-2007

Hi

If something (not the interrupt controller) is calling your ISR, then a simple breakpoint in the ISR will allow you to trace the call stack and to see where the call originated. However, it seems very unlikely that this is the case because the ISR will terminate with an RTE not an RTS so if there was some spurious way for your code to execute the ISR it would almost certainly fail as soon as the RTE was executed.

If the interrupt controller is responsible for the call to the ISR (most likely) then all you need to do is to set a breakpoint at the start of the ISR and look at the exception frame on the stack. This will tell you which exception was rasied to cause the ISR to be executed.

Do you have a file called mcf5xxx.c with a function mcf5xxx_exception_handler() in it?

This is called by an assembly stub

asm_exception_handler

to decode exceptions that are caused by an error.

If you have your error exception vectors pointing to irq_handler then any illegal address access, etc. will call irq_handler. If you are right and all of your user definable exceptions are masked, then this may be what is happening.

Check your vector table is correct and unmodified at run time. Also, if you can find the exception frame on the stack when you have reached a break point in irq_handler then this will _definately_ tell you which exception happened.

Edit:

If you do have the code for mcf5xxx_exception_handler() and asm_exception_handler, you could point _all_ the vectors to asm_exception_handler, and break in mcf5xxx_exception_handler to have a look at the exception frame. The assembly stub just points %a1 to the exception frame so that it can be used as a C argument.

Paul.

Message Edited by mccp on 2007-11-21 10:33 AM

DarrenSteadman · ‎11-21-2007

Yes I have mcf5xxx.c with mcf5xxx_exception_handler() in it.

asm_exception_handler is pointing to the correct place.

If I put a break point into the irq_handler the call stack doesn't make any sense, it will be inside a function that is just a normal function. If I then continue execution the next time the irq_handler gets called the call stack will be different, then if I do it again the call stack will once again be different. Some times if I stop the board executing and restart the program from scratch the call stack will be completely different to the first time I ran it.

Which register is the exception vector in? A1-7? PC? SR?

Do you have a link to a document that could tell me what the value of the register means?

TrevorCurry_eu · ‎11-21-2007

Hi,

Sorry that I have not been feeding back my experiences to this thread...

I found that the interrupt causing the problem was from, uart 2 - despite this being masked! I then got an indication from Freescale that there was a problem here:

> > With regard to the spurious UART2 interrupt, Freescale believe they
> > have solved a similar issue before. ICR registers of particular
> > interrupt sources should be always programmed with unique and
> > non-overlapping level & priority.

When I checked the uart ICR settings, I found the ColdFire Lite stack had both uart 0 (ICR13) and uart 1 (ICR14) set to level 2. I have set ICR13 to level 1 and have not seen the problem since.

I hope that helps...

Cheers,
Trevor

DarrenSteadman · ‎11-21-2007

I'll give that a try and see what happens.

As for the changing the define in the vectors file it didn't actually do anything. I tried commenting out the irq_handler and the linker complained so it was still using it.

I'll try the priority thing as I am using two uarts if that doesn't work I'll try removing the irq_handler and re-directing it to the asm_exception_handler and see what i get.

TrevorCurry_eu · ‎11-21-2007

I isolated my problem to the uart 2 interrupt by putting my own dummy irq handler in the uart 2 vector - and the code broke on that!

Good luck,
Trevor

AN3470SW CW 6.3 memmove / tk_switch problem

AN3470SW CW 6.3 memmove / tk_switch problem

General