K64 - How to debug an ENET issue

JHinkle · ‎11-11-2016

I'm running a test on my K64 with FreeRtos TCP stack - I wrote the K64 driver.

I have the K64 receiving UDP packets -- 16 packet bursts every 50 milliseconds.

It works great for 15 to 20 minutes and then the ENET just stops receiving packets -- its dead.

I've checked all the receive buffers and there all empty and available for use.

The only error I saw (looking in the EIR register) was a Recv Babbling error.

The physical is still blinking so it appears it is still working.

I checked register RCR and the GRS flag is set. (Graceful Receive Stop).

I must be getting old because I read the K64 documentation on Graceful receive stop and it says software or hardware can set it -- there by stopping all reception. -- Set when MAC is in STOP mode or hardware Freeze mode.

I have debug enabled in the ENET so I suspect this flag gets set when I pause execution as an indication that the ENET has also paused.

If THIS is my issue -- please let me know.

Other than that -- the ENET behaves as if it is no longer receiving MMI data.

How does one debug to see if there is still MMI data coming into the ENET?

How does one debug to make sure the physical has not stopped working and THAT is the issue?

Any insights or comments - please share.

Joe

JHinkle · ‎11-12-2016

Solved it.

Pulled my head out of my ass ... and increased the number of receive ring descriptors. Made sure there was enough to handle a full bust.

Works like a charm.

Early testing set it too low. Forgot what the setting was until now.

Joe

View solution in original post

JHinkle · ‎11-12-2016

Mark:

I had set the number of ring descriptors at 20 during early development.

Running test of 16 UDP messages in tight succession, resulted in failure after random duration.

The app is spec'd to handle 96 messages within 3 millisec -- that grouping every 50 millisec.

As soon as I had the message generator start to transmit a 96 message burst -- the ENET halted during the second burst. My receive counter showed 149 messages processed before it halted.

I wanted to guarantee that a grouping of 96 within 3 millisec would always be serviced -- I raised the receive descriptor count to 100 and the failures stopped.

I tested it for 8 hours today and it ran like a charm.

The processor is running around 70% during the 50 millisec period to fully process the messages.

Timing wise -- I have to receive the majority of the busrt within a very short period of time and then spend the rest of the burst gap processing them.

Makes perfect sense once I understood the issue.

Thanks for your help.

Joe

JHinkle · ‎11-12-2016

Solved it.

Pulled my head out of my ass ... and increased the number of receive ring descriptors. Made sure there was enough to handle a full bust.

Works like a charm.

Early testing set it too low. Forgot what the setting was until now.

Joe

mjbcswitzerland · ‎11-12-2016

Joe

I tend to use about 8 rx buffers (full size) or in environment that have large amounts of short broadcasts I activate the define USE_MULTIPLE_BUFFERS which chains 256 byte buffers instead (this increases the amount of buffers for short frame reception by a factor of about 6).

96 sound like a huge amount of buffers (and space). In the worst case when there is no buffer space for further receptions the EMAC will drop a frame and increment the dropped frame counter so that the fact can be monitored. With 8 buffers it is already very difficult to get overruns, even with streaming UDP data at high speeds but it will depend om the efficiency of the stack being used and RTOS performance.

Running out of Rx buffers wouldn't normally cause a stall in operation so I would recommend reducing to a fairly small number for tests since increasing the number even more might just be reducing the chance of whatever was causing problems happening and so not really be a fully reliable solution. Eg. Try having also a small storm of (short length) broadcast frames at the same time as your basic operation and see whether it survives, since real networks do have these sometimes and you need to be able to at least continue operating, even if it causes some dropped frames.

Regards

Mark

P.S: The MIB counters are enabled by default so can be read in the debugger at any time. They should be reset once when starting using something like

MIBC = (MIB_DISABLE | MIB_CLEAR); // ensure MIB is disabled while resetting and command clear to reset all counters in the maintenance block RAM
MIBC = 0;

JHinkle · ‎11-11-2016

One other note:

I am receiving the 16 packets - multicast.

JHinkle · ‎11-11-2016

My test just failed again.

ENET is dead and I don't have the foggiest idea of what steps to take to debug it.

I've had this ENET driver working full time for months without issue. I just started stressing it by sending 16 UDP packets (700 bytes each) at a 50 millisec rate (that's 16 packets tightly grouped - sent every 50msec).

I blink an LED every time a packet arrives so I can monitor things. I have 96 receive buffers -- all are empty which says the receive task has not died ... the Rcv IRQ has just stopped firing and there is not Interrupt flag that says one came in and is waiting ... so I conclude something has halted the ability to receive packets.

ENET or physical?

I can scope the physical but I'm using MII mode so its a wide bus -- I have not read up on how it works to understand the signals meanings.

any ideas?

Thanks.

Joe

JHinkle · ‎11-11-2016

Mark:

I had implemented my own receive/error/transmit counters so I did not pay attention to the internal RMON registers.

Mine show all zeros.

How do you enable them? I've search the documentation and nothing suggests how to enable them.

Thanks.

Joe

JHinkle · ‎11-11-2016

When I first started writing the driver, I wrote bit-bang code to talk directly to the KSZ. I forget exactly where, but I found a condition where I was attempting to drive a bit at the same time the KSZ was -- per the spec ... as I said -- I can't remember the details.

I was single stepping thru the code to verify the design .. and after several debugging passes -- the K64 output pin fried.

I was pissed at myself for not catching it and was pissed at the amount of time it was taking, ... so ... I decided to self the whole thing to a later date.

Ironically, -- I was just thinks about that interface again the other day ... so I will take a look at yours and see how close I was to your implementation.

Thanks.

Joe

JHinkle · ‎11-11-2016

Thanks Mark:

I have checked the MIB counters -- I'll check again when it fails again --- it's been running for 30 minutes now with no issue.

I have a link check taking place every 5 secs in my idle task -- link is always up.

I'm using a Micrel - KSZ8863MLL - 3 port switch as my physical. Right now I'm simply using the MII interface to talk to it (very much reduced register set) but that is always up and running also.

It's failed twice today --- now watch -- it will run for days before it fails again. It hate these types of issue.

Joe

mjbcswitzerland · ‎11-11-2016

Hi Joe

I have used the KSZ8863MLL in several Kinetis and Coldfire based products - in the file that I attached there is as SMI driver for it so that it can be fully controlled - writes are in fact MDIO compatible but reads need to be bit-banged.

The tail-tagging mode is very interesting since it allows two Ethernet ports to be created from a single Ethernet controller, as well as switch operation; multi-homed stack applications The tail-tagging mode set/reset is the routine fnSetTailTagMode().

Regards

Mark

mjbcswitzerland · ‎11-11-2016

Joe

Check the MIB counters in the EMAC to see whether they are counting frame reception when you have the problem since it may tell you that it is dropping frames due to a certain reason.

Verify that link state changes are being correctly handled by the ENET driver, including resetting and restarting the EMAC when it takes place.

I have worked on a number of products using K64 and don't know of any stalls in reception - I just know that the errata E2647 workaround is recommended (needed) for transmission, although theoretically this is no longer an errata in this chip.

Attached is the ENET driver from the uTasker project as reference.

Regards

Mark