mcf52259 FEC RXF ISR stops sometimes

alager · ‎03-11-2015

We have a project where after some unknown event happens, the FEC stops triggering the RXF interrupt. It could be minutes, could be days before the event happens. Wiresharking hasn't shown anything suspicious. Tx is still working, in fact, our clue that there is a problem is that the device is refreshing its ARP table on its 20 minute boundary but doesn't listen to the response.

Things I've checked when this occurs:

EBERR is 0

RXF is 0

RXB is usually 1, but that mask is not set

ETHR_EN is 1

RDAR is 1

Also, not every device seems to do this. Out of 100 devices probably 5-10 exhibit this behavior.

Any help in how to trouble shoot this would be great.

Thanks,

Aaron

TomE · ‎03-18-2015

> A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses.

There are 16 FEC-related pins on the CPU. See if you can see a difference (in level, frequency, slope) between "working" and "not working".

Do you have any code monitoring the LINK bit in the PHY over the MII? Just in case some of its internal bits flip when it goes wrong it might show something. You must have something monitoring the PHY (to get 10/100 MHz auto-negotiation information for matching the FEC to the PHY), but I noticed the MII interrupt wasn't enabled. Is it polling with a timer?

We have some very simple code monitoring the link. In case this is useful for someone else looking to do this We periodically send this:

// start a fresh read

MCF_FEC_MMFR = MCF_FEC_MMFR_ST_01 | MCF_FEC_MMFR_OP_READ |

MCF_FEC_MMFR_PA(phy_addr) | MCF_FEC_MMFR_RA(1)|MCF_FEC_MMFR_TA_10;

And the MII ISR does this:

void fec_mii_isr( void )

{

MCF_FEC_EIR = MCF_FEC_EIR_MII;

fec_link_stat = ( MCF_FEC_MMFR & 0x04 ) != 0;

}

> > Have you found a way to recover from this?

> A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

So a reset of the FEC doesn't fix it. So what does a chip reset do that resetting the FEC doesn't? It reprograms the GPIO pins.

You should check to see if any of the GPIO registers that control assignment of the 16 FEC pins have flipped. corrupted or gotten stomped by a rogue pointer somehow.

"Table 2-1. Pin Functions by Primary and Alternate Purpose" shows the FEC pins are PTI and PTJ (selected by PTIPAR and PTJPAR) as well as being affected by PSRRH and PDSRH.

You have a powerful tool there. In the screen snapshot you provided the debugger helpfully highlighted the I/O registers that had changed since the last time it sampled them. That removes the drudgery of manually comparing or decoding them all. That didn't show anything interesting with the FEC registers, so I suggest you see if any *OTHER* I/O registers changed.

Does the register window only show changes in the ones showing in the window, in the ones where that part of the tree is "open" or does it sample and list ALL of them? It might show a change that is causing this.

You could also get the "foo" code to call the function that reprograms all the GPIO pins, and see if that fixes it.

Tom

元の投稿で解決策を見る

TomE · ‎03-11-2015

Compare your interrupt handling code with that here:

http://lxr.free-electrons.com/source/drivers/net/ethernet/freescale/fec_main.c#L1535

int_events = readl(fep->hwp + FEC_IEVENT);

writel(int_events, fep->hwp + FEC_IEVENT);

fec_enet_collect_events(fep, int_events);

You should be reading the FEC_IER register, writing the exact read value back into FEC_IER, and THEN checking the event bits in that register to see what you should be doing (like scanning through the Receive BDs). That's the safest and most efficient way to handle the events without losing any.

Are you sure you're not taking the interrupt, scanning the (one or more ready) received buffer descriptors and THEN clearing the interrupt? That's guaranteed to lose them. You can also get multiple receives and only one interrupt, so you have to read "all that are there" and not just one.

Tom

alager · ‎03-17-2015

Tom,

Thanks for the reply. Our receive interrupt is "stand alone". Each interrupt has its own vector, so we don't need to figure out why we are in the interrupt. So (as can be seen in the code snippet below) we immediately clear the EIR bit for this interrupt.

__declspec(interrupt) static void fecISR_RXF ( void )

{

devCtrl->eir = 0x02000000; // clear the bit

// stuff a message to send to the handler task

...

}

Attached is a screen shot showing the FEC registers and the buffer descriptors. From what I can understand they buffers are pretty much all free. No dropped frames etc.

As a side note, we are sending frames just fine. The device is sending an ARP request for the gateway repeatedly, which is responding each time. But we don't see the answer.

Aaron

TomE · ‎03-17-2015

> Out of 100 devices probably 5-10 exhibit this behavior.

Under identical test conditions in your shop or at 100 different customer locations (receiving different data patterns)?

The usual then. Check all the voltages. Check for noise. Add lots more bypass caps to a failing unit. Check for crystal/oscillator stability. Check for undershoots/overshoots on the external pins (between the CPU and the PHY). Run hot and cold. Run at low and high supply voltages. Change the clock speed (if you can). Play with the Arbiter settings.

Hit it with "sudo ping -f -l 20 -s 100" from a Linux box with small and large packet sizes and see if you can get it to fail more often.

> From what I can understand they buffers are pretty much all free.

I'll analyse that next, but the only thing that should stop the receive interrupts, is either there's no data arriving (something went wrong with the PHY or the pin programming) or the receiver is "stuck" on a full descriptor and is waiting for you to read it. The latter can happen if the ring is full, or if it is just out of sync.

First I want to check the EMRBR and RCR[MAX_FL] against the programmed buffer sizes.

EMRBR = 000005F0 - Buffer Size is 05FF or 1535

RCR = 05EE0004 - Buffer Size is 05EE or 1518

ERDSR = 10040000

10040000 = 80000040 10040100 -

10040008 = 80000040 10040740 - 740 - 100 = 640 = 1600 bytes per buffer.

That's all fine. You're using 8 Receive Descriptors, all are empty and the last one has the "Wrap" bit set.

What about RDAR? It is 0x01000000, so the FEC thinks it has free descriptors.

I still think your PHY has locked up or you've got a hardware problem external to the chip. Are you monitoring for Link Status through the MII? Do you have a Link LED connected to the PHY?

You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5". That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

Have you found a way to recover from this? Does disabling and enabling the FEC fix it (ECR[ETHER_EN])? How about resetting and reprogramming (ECR[RESET])? Do you have a reset control to the PHY? I see you have a "foo" variable in the code that looks like you're using it to force reset and reprogramming. What have you found using that?

Can you try and test it with Internal Loopback? Can you switch it to Loopback when it has failed?

Tom

alager · ‎03-18-2015

See answers inline:

>Under identical test conditions in your shop or at 100 different customer locations (receiving different data patterns)?

A: I only have one in the shop that will do this "consistently" sometimes every 20 minutes, sometimes not for a week. Most are out in the wild at customer locations

>Check all the voltages. Check for noise. Add lots more bypass caps to a failing unit. Check for crystal/oscillator stability.

A: I did check the supply, 3.3v with a dmm. but you make me think I should scope it, could be sag/spike related.

>Check for undershoots/overshoots on the external pins (between the CPU and the PHY).

A: I did put a scope on the fec_RXDV pin on the MII interface, to see if the PHY was stopped. The pulses look good on that one, and they never stopped.

>Run hot and cold.

A: did this, couldn't find a correlation.

>Run at low and high supply voltages.

A: this would be a challenge, I'll look into it.

>Change the clock speed (if you can).

A: I'll look to see what our PLL is set to, the xtal is 25Mhz, the cpu runs at 50Mhz.

>Play with the Arbiter settings.

A: I'll look into this, not sure what/where it is.

>but the only thing that should stop the receive interrupts, is either there's no data arriving (something went wrong with the PHY or the pin programming) or the receiver is "stuck" on a full descriptor and is waiting for you to read it. The latter can happen if the ring is full, or if it is just out of sync.

A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses. And as you verified, the BD are all marked as empty.

>I still think your PHY has locked up or you've got a hardware problem external to the chip.

A: The PHY isn't completely locked up, since the system is sending ARP requests, and we are getting fec_RXDV pulses for new frames.

>Are you monitoring for Link Status through the MII?

A: no

>Do you have a Link LED connected to the PHY?

A: yes link is on, and the activity LED blinks with every TX frame the device sends. It is not designed to blink with RX packets.

>You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5".

>That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

A: The cpu runs at 50Mhz. I'll look at the MSCR setting to see what is more appropriate.

>Have you found a way to recover from this?

A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

>Does disabling and enabling the FEC fix it (ECR[ETHER_EN])? How about resetting and reprogramming (ECR[RESET])?

>I see you have a "foo" variable in the code that looks like you're using it to force reset and reprogramming. What have you found using that?

A: nope, and nope. yeah, the foo variable is to force the re-init of the fec and phy, but to no avail.

>Do you have a reset control to the PHY?

A: yes, thanks for making me look. :smileyhappy: I'll try playing with this in addition to the above re-initing the fec and phy.

>Can you try and test it with Internal Loopback? Can you switch it to Loopback when it has failed?

A: I hadn't thought of this, I'll try and see what happens.

I have some homework to do. It'll probably be a day or so to check all this out. However, if you see anything in my current answers that points to something, please let me know.

Thanks,

Aaron

TomE · ‎03-18-2015

> A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses.

There are 16 FEC-related pins on the CPU. See if you can see a difference (in level, frequency, slope) between "working" and "not working".

Do you have any code monitoring the LINK bit in the PHY over the MII? Just in case some of its internal bits flip when it goes wrong it might show something. You must have something monitoring the PHY (to get 10/100 MHz auto-negotiation information for matching the FEC to the PHY), but I noticed the MII interrupt wasn't enabled. Is it polling with a timer?

We have some very simple code monitoring the link. In case this is useful for someone else looking to do this We periodically send this:

// start a fresh read

MCF_FEC_MMFR = MCF_FEC_MMFR_ST_01 | MCF_FEC_MMFR_OP_READ |

MCF_FEC_MMFR_PA(phy_addr) | MCF_FEC_MMFR_RA(1)|MCF_FEC_MMFR_TA_10;

And the MII ISR does this:

void fec_mii_isr( void )

{

MCF_FEC_EIR = MCF_FEC_EIR_MII;

fec_link_stat = ( MCF_FEC_MMFR & 0x04 ) != 0;

}

> > Have you found a way to recover from this?

> A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

So a reset of the FEC doesn't fix it. So what does a chip reset do that resetting the FEC doesn't? It reprograms the GPIO pins.

You should check to see if any of the GPIO registers that control assignment of the 16 FEC pins have flipped. corrupted or gotten stomped by a rogue pointer somehow.

"Table 2-1. Pin Functions by Primary and Alternate Purpose" shows the FEC pins are PTI and PTJ (selected by PTIPAR and PTJPAR) as well as being affected by PSRRH and PDSRH.

You have a powerful tool there. In the screen snapshot you provided the debugger helpfully highlighted the I/O registers that had changed since the last time it sampled them. That removes the drudgery of manually comparing or decoding them all. That didn't show anything interesting with the FEC registers, so I suggest you see if any *OTHER* I/O registers changed.

Does the register window only show changes in the ones showing in the window, in the ones where that part of the tree is "open" or does it sample and list ALL of them? It might show a change that is causing this.

You could also get the "foo" code to call the function that reprograms all the GPIO pins, and see if that fixes it.

Tom

alager · ‎03-19-2015

Tom,

Thanks for your help. It's nice to bounce ideas off of fresh minds. I started probing the MII interface and it turns out that the RX_CLK was only 0.5vpp. It was a bad solder joint. Argh!

I think the fact that the problem would go away after a reset was leading me astray. Your input definitely got me to thinking whole picture.

Now to figure out if we have a production issue with the rest of these things that don't keep talking Ethernet.

Aaron

TomE · ‎03-19-2015

> It was a bad solder joint.

I'm assuming you found a genuine dry or cracked joint, verified it visually, verified by bending the board (or flexing the joint) and then fixed by resoldering. My suggestion of "heat it, cool it" might have found that. I didn't suggest "bend it" because you said you had lots of other units with the "same fault".

> Argh!I think the fact that the problem would go

> away after a reset was leading me astray.

I can't see a how a Reset fixing a bad solder joint, unless you have a reset press-button on the board and you bent the board while resetting it. Or reset it by physically unplugging it and mechanically stressed the board while doing that. But you said "reset from the debugger" fixed it, so that's not the case.

The bad joint doesn't explain all the other units. They must have different faults to the one with the bad joint. It would be a very strange hardware design or production problem that consistently creates one specific bad joint.

It might still be possible that you have some tracks shorted together. You might have some other pin on the CPU shorted to a FEC signal (maybe even RX_CLK), and when that pin is programmed as an output, there's a "fight" between the PHY pin and the CPU pin and the resulting signal exceeds the receiver threshold on some boards at some times and not on others. Throwing a bad joint into that mix would certainly show a bad signal level. Shorted tracks could be a design/production problem.

Tom

alager · ‎03-18-2015

>Do you have a Link LED connected to the PHY?

testing with code that blinks for both RX and TX shows that there is no discernible difference in the LED behavior from normal operation to abnormal operation.

The link LED is steady on, and turns off when I unplug the cable.

alager · ‎03-18-2015

Loopback testing in the FEC seems to indicate that the issue isn't in the FEC.

1) when things are operating normal, I set the loop and prom bits, I then get an interrupt for each frame I send.

2) set the bits back to normal, and watch operation until the failure appears

3) again set the prom and loop bits and I get an interrupt for each frame I send.

4) turn loop off, leave prom on, still no RX interrupts.

alager · ‎03-18-2015

>You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5".

>That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

So as I said the CPU runs at 50Mhz, The driver code tried to write the value 0xa into the register, but since MII_SPEED is bit shifted by one, with respect to the MSCR base address, the value of 5 was being written.

5 would make the clock speed be 5Mhz, which is out of spec. I've fixed the driver to write an actual 0xa into MII_SPEED, and am testing now.

Aaron

alager · ‎03-18-2015

Well the MII_SPEED wasn't THE problem, but it was A problem. Still messing up.

EmbEng · ‎09-02-2021

@alager were you able to solve the issue mentioned in subject? I am having similar issue too.

TomE · ‎09-08-2021

Alager last posted over 6 years ago. Who would still be working on developing products based on the same 12 year old (2009) chips 6 years later? I guess you for one, and me for two :-).

Please describe your system and your symptoms. Make a list of all of the suggestions that were investigated in this thread, check them on your system, and then list all of them in a post to this thread. Your best bet is to check all of those things. You'll probably solve it by doing that.

There's no "one cause" to these problems, no one single magic bullet.

There's the "Search" field at the upper right where you can look for other people with the same problem. That's probably how you found this thread. I did that back in 2017, so you might want to check all of these. It is worth checking Coldfire and i.MX solutions as they pretty much have the same Ethernet controller. There's one there that definitely found and fixed a (common) programming problem.

https://community.nxp.com/thread/457044

Extra references to other i.MX problems:

i.MX6 FEC stops generating receive interrupts
https://community.nxp.com/thread/322882

FEC on imx6q stops receiving packets
https://community.nxp.com/thread/384970

imx6d fec buffer descriptor update incoherent
https://community.nxp.com/thread/435323

FEC ethernet packetloss
https://community.nxp.com/thread/316594

https://community.nxp.com/message/936589

But here's another thing that could be causing this. The Interrupt Priority Programming.. All of the MCF52xx chips have this problem (it was fixed in the MCF53xx ones). If you get this wrong you can LOSE interrupts when two happen at the same time.

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Coldfire-V2-Kernel-ISR-invoked-even-while...

There are three ways to make these bad things happen with interrupts:

1 - Use IPL7. So just don't. Ever. Unless you really need a non-maskable interrupt to fire ONCE.

2 - Interrupts with the same Level and Priority. You can't have duplicates as it confuses the interrupt controller.

3 - Leaving CPU interrupts enabled when you change the interrupt enables in the interrupt controller or anywhere else. Make sure you force the CPU to IPL7 around all of these, and that explains why (1) above is a bad idea too.

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/Spurious-interrupt-with-declspec-interrup...

Tom

mcf52259 FEC RXF ISR stops sometimes

mcf52259 FEC RXF ISR stops sometimes

General