See answers inline:
>Under identical test conditions in your shop or at 100 different customer locations (receiving different data patterns)?
A: I only have one in the shop that will do this "consistently" sometimes every 20 minutes, sometimes not for a week. Most are out in the wild at customer locations
>Check all the voltages. Check for noise. Add lots more bypass caps to a failing unit. Check for crystal/oscillator stability.
A: I did check the supply, 3.3v with a dmm. but you make me think I should scope it, could be sag/spike related.
>Check for undershoots/overshoots on the external pins (between the CPU and the PHY).
A: I did put a scope on the fec_RXDV pin on the MII interface, to see if the PHY was stopped. The pulses look good on that one, and they never stopped.
>Run hot and cold.
A: did this, couldn't find a correlation.
>Run at low and high supply voltages.
A: this would be a challenge, I'll look into it.
>Change the clock speed (if you can).
A: I'll look to see what our PLL is set to, the xtal is 25Mhz, the cpu runs at 50Mhz.
>Play with the Arbiter settings.
A: I'll look into this, not sure what/where it is.
>but the only thing that should stop the receive interrupts, is either there's no data arriving (something went wrong with the PHY or the pin programming) or the receiver is "stuck" on a full descriptor and is waiting for you to read it. The latter can happen if the ring is full, or if it is just out of sync.
A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses. And as you verified, the BD are all marked as empty.
>I still think your PHY has locked up or you've got a hardware problem external to the chip.
A: The PHY isn't completely locked up, since the system is sending ARP requests, and we are getting fec_RXDV pulses for new frames.
>Are you monitoring for Link Status through the MII?
A: no
>Do you have a Link LED connected to the PHY?
A: yes link is on, and the activity LED blinks with every TX frame the device sends. It is not designed to blink with RX packets.
>You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5".
>That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.
A: The cpu runs at 50Mhz. I'll look at the MSCR setting to see what is more appropriate.
>Have you found a way to recover from this?
A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.
>Does disabling and enabling the FEC fix it (ECR[ETHER_EN])? How about resetting and reprogramming (ECR[RESET])?
>I see you have a "foo" variable in the code that looks like you're using it to force reset and reprogramming. What have you found using that?
A: nope, and nope. yeah, the foo variable is to force the re-init of the fec and phy, but to no avail.
>Do you have a reset control to the PHY?
A: yes, thanks for making me look. :smileyhappy: I'll try playing with this in addition to the above re-initing the fec and phy.
>Can you try and test it with Internal Loopback? Can you switch it to Loopback when it has failed?
A: I hadn't thought of this, I'll try and see what happens.
I have some homework to do. It'll probably be a day or so to check all this out. However, if you see anything in my current answers that points to something, please let me know.
Thanks,
Aaron