MCF5xxx Colfire V2 FEC RX Interrupt Stops Firing After "Packet Storm"

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MCF5xxx Colfire V2 FEC RX Interrupt Stops Firing After "Packet Storm"

Jump to solution
3,310 Views
jeremym
Contributor IV

We have a scenario where 12 ethernet frames are received (or attempted to be received) in 37 microseconds by the MCF52259 FEC which results in the RX interrupts to cease firing.

When in the failure state the transmit portion of the FEC works fine.  I have checked all the pin assignments and they look fine, as do the interrupt enables/masks for receiving.  I have tried putting the FEC into loopback mode, but when it transmits I see no receive interrupts.  This makes me think its an issue with the FEC and not the PHY or external circuit.  If we reduce the "packet storm" to 4-8 packets in the same window the error does not seem to occur.

***Additional Info:

 - In the failure state the debugger shows that all our RX BD's are available but the CONTROL field (in MQX) of each BD still has the MCF%XXX_FEC_BD_ETHER_RX_EMPTY bit set.  The FW is responsible for the setting of this bit, and the FEC is presumably responsible for clearing it when data is received into it.  Again it looks like the RX portion of the FEC is hung.

The only recovery I have found is to pin-reset the chip.  I am wondering:

1) other than the RX interrupt enable/mask is there any other way that the FEC could prevent the RX interrupt from firing?  Possibly lack of ring buffer space or something similar?

2) Any clues on resetting just the RX portion of the FEC to attempt a recovery?

thanks,

Jeremy

Labels (1)
0 Kudos
1 Solution
2,433 Views
jeremym
Contributor IV

The MQX driver does nearly the same thing as your example; it does check for available available RX buffer-descriptors prior to the empty check though.  One thing I noticed is that the MQX driver currently only enables the TXB and RXF interrupts.  The comment in the code states that TXF and TXB are co-generated, but that RXB is NOT co-generated with RXF.  But then the comment goes on and states that RXB is purposefully NOT enabled, nor needed, because it results in HBERR interrupts firing.  According to the Ref. Manual RXF indicates that frame received and last corresponding buffer describer has been updated.  RXB indicates receive buffer descriptor, not the last in the frame, has been updated.  

I wonder if not enabling RXB is resulting in missing a critical buffer-descriptor servicing operation by the firmware.  Even though I didin't see any buffer descriptors left in a 'bad' state during the failure, I may be missing something more subtle.

I am going to experiment with putting RXB back in and handling HBERR should they fire.

*** UPDATE: Enabling the RXB interrupt and the HBERR interrupt seems to have resolved the issue.  The HBERR handler simply clears the interrupt.  I was able to 'survive' the packet flood and recover as expected.  MQX driver has the HBERR handler coded, and it remains a mystery as to why they did not leave the RXB enabled along with the HBERR.  Maybe there will be other side-effects, but for now we will move forward with this enhancement and do integration testing before we commit the change to the mainline.  The comment in the MQX code to the effect that RXB is not necessary appears to be true as long as you don't overflow/overrun your RX buffers.

Tom - thanks for you help - your persistence on this issue key.

View solution in original post

0 Kudos
15 Replies
2,433 Views
TomE
Specialist II

Type "FEC receive interrupt" into the Search at top right and go through the previous reports.

This one was a bad joint, but there are a lot of suggestions for further investigation in there:

https://community.nxp.com/message/494380?commentID=494380#comment-494380

It sounds like you have an interrupt race condition. You have to lock out interrupts when running any code that accesses the FEC that isn't directly called from the Interrupt. The FEC also has THIRTEEN different interrupt vectors. You have to be very careful in your interrupt controller setup to ensure than none of those interrupts can interrupt each other. That is made maddeningly difficult in CFV2 because all interrupts on the one controller have to have unique priorities and levels, and there's only 8 priorities in one level, and you have 13 interrupts, and 13 into 8 won't go...

Where did your FEC driver come from? Did you write it yourself? Is it from a commercial or otherwise OS?

MQX?

Compare your FEC driver with the closest Linux one. Any obvious difference might show what you're doing wrong.

http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/freescale

See how this one works. Download and then see if you can RUN the demo project. See if it has any Ethernet problems (unlikely).

http://www.utasker.com/kirin3.html

Tom

0 Kudos
2,433 Views
jeremym
Contributor IV

Hi Tom - 

Thanks for the info.  We are using the MQX driver (currently 3.6.x) - I have compared the driver to 3.8.x and don't see any differences.  The interrupt priorities are likely the issue; I will look into the possibility that we have a race condition.    Is there any errata associated with this, or a best practice recommendation for the V2?  We have not modified the FEC driver in any way.

thanks,

Jeremy

0 Kudos
2,433 Views
TomE
Specialist II

> 12 ethernet frames are received (or attempted to be received) in 37 microseconds

Is "12" a "magic number"? Does that correspond to anything in your code, like 12 buffers in the receive ring? Can you send longer and shorter bursts and see where the problem starts?

https://en.wikipedia.org/wiki/Ethernet_frame#Structure

12 in 37 microseconds is 3.08us/frame or 308 bits at 100MBit/s or 38 octets per frame. Since the minimum Ethernet frame length is 72 octets ignoring the IPG, I'm going to have to say there's something wrong with your measurements. The maximum rate for minimum size Ethernet packets at 100MHz is ((72+12) * 8)/100 = 6.72us, or 12 in 80.6us.

How many receive buffers do you have? Is MQX using simple code with huge buffers that wastes memory (so one Ethernet frame fits into one 1600-byte buffer) or is the code chaining smaller buffers together like mine does?

Try testing with longer and shorter Ethernet packets and see if that makes a difference. Longer ones allow more time for the code to handle the interrupts, That might tell you something.

The next question in diagnosing these sorts of problems is "did it ever work"? So did you break it or was it broken to start out with? Have you done something with the setup or configuration of the system that is making it fail? Consider changing the buffer sizes or numbers. Check the interrupt level and priority setup. Keep reading the "Note" in "16.3.6 Interrupt Control Registers (ICRnx)" until you fall off your chair. If that note doesn't scare you then "you don't understand the problem".

Does MQX allow multi-level interrupts or does it do some "single-level simplification" like disabling all interrupts (setting IPL7) on entry to every interrupt service routine? If it does that then some other interrupt in your system might be locking out the Ethernet interrupts for too long, causing an overrun. It should be able to recover from this, but you don't want it happening.

Is anyone reading this forum who runs MQX on a V2-series ColdFire with the same FEC as the MCF52259 and is using Ethernet? That sort of information would be very helpful.


Then there's the MQX forum, but a quick search there doesn't show anything.

Now the bad news. You're using MQX so you don't have to understand the details of this chip. But when it fails, the only way out may be to get a deep understanding of the CPU, interrupts and the FEC.

The best way to do that is to read the FEC chapter a few times, and then construct a flowchart of the MQX driver. Then do the same for the Linux and uTasker drivers. Look for differences.

The way the receive code must be written is as follows:

1 - Interrupt on EIR[RXF] or EIR[RXB] (usually RXF unless you're being very tricky).

2 - Write "EIR = RXF" to clear the request, FIRST.

3 - Loop through the receive ring, reading buffers until you find an empty one.

If the code writes to RXF after reading the rings it is doomed to fail. If it only reads one buffer per interrupt it is also doomed to fail.

I don't know how good the MQX drivers are. I do know that many drivers for other Freescale CPUs (i.MX53) are more like "sales demo quality" and need serious work before then can be put in a reliable product. I've had bad experiences with CAN (multiple faults), SPI (multiple faults), I2C (lockup), PWM (glitches badly), FEC (IPv6 Multicast doesn't work) and so on.

Tom

0 Kudos
2,433 Views
jeremym
Contributor IV

Hi Tom - thanks again for the input.  Here are some answers:

1) yes the driver works almost all the time - we have a new system behavior which produces these 'bursts' of packets - have had no issues with the driver over the last 7 years

2) the measurements of the packet timings are from a wireshark capture on the network.  Your comment resulted in us looking at the packet size; we use our own layer-2 protocol with nearly all the packets being under 72 octets in length.  We are wondering if that may be at the root cause of the issue with the FEC?  We are going to pad them out and investigate.

The rest of the issues you raise with the driver seem to be addressed properly in the MQX driver we are using.  I did try adjusting priorities and adding more RX buffers (RAM is very scarce) - and only adding RX buffers seems to alleviate the issue somewhat.  I am guessing that it allows us to keep up longer during the bursts.  We originally had 4 RX buffers and increased it by '1' to 5 total.  This almost completely alleviates the issue under our current system design.  

It's the adding of a single RX buffer which still makes me think its some driver issue - or some fundamental problem with the FEC when we overrun it?  Will update this post when we try padding out the Ethernet frames to the minimum.

0 Kudos
2,433 Views
TomE
Specialist II

> we use our own layer-2 protocol with nearly all the packets being under 72 octets in length

Is that an accident, ignorance or do you have an "Evil Genius" at work there?

Are you pushing the performance of your network so much that you need these short packets?

To do that (use short packets) you may need to make your own Custom Silicon Controllers for all devices on the network using those packets. Including any Ethernet Switches.

The minimum Ethernet frame length is a *BIG* part of Ethernet. It is a required part of the lowest-level network protocol, and is assumed to be there by the designers of all of the controllers.

At the lowest and original level (the original 10MHz Ethernet) the minimum packet length is essential to be able to detect packet collisions over a maximum-length repeater-based network. Collisions within the first that-many bytes are considered "normal" and result in a Retry. And collisions after that time are a "Late Collision Error".

Any frame shorter than "64 bytes" (where you start measuring from has always confused me) should be detected as a "runt packet" and rejected:

21.5.8 FEC Frame Reception

After a collision window (64 bytes) of data is received and if address recognition has not rejected the
frame, the receive FIFO signals the frame is accepted and may be passed on to the DMA. If the frame is a
runt (due to collision) or is rejected by address recognition, the receive FIFO is notified to reject the frame.

I think that means that hardware is allowed to reject shorter packet "just because they're short", although they might have logic that only rejects them if they're short with a CRC or framing error. This chip doesn't have a flag for this error.

Also read this:

https://en.wikipedia.org/wiki/Ethernet_frame#Runt_frames

There's a video at after the above link that takes 4 minutes to say this, and gets it wrong (about the speed of light), and the slides don't mention half-duplex or switches either.

100MHz Ethernet means the network can only be 1/10 as long as the 10MHz one - which was 2.5km. So you're not running half-duplex, so errors can't happen, and the distance from the switch to the unit is never more than 100m, so the minimum length isn't required for collision detection. So why might it cause problems?

As an aside, you've written that you don't have much RAM. That's because you only have 64k for the whole system in that chip. And you need the receuve and transmit rings and the buffers in there. I'm using an MCF5234 that also has 64k with 32k reserved for the stack. 16k is reserved for the Receive Buffers. In that I have 64 Receive Buffers. That means I'm using 256-byte receive buffers and chaining them together. If you have overrun or latency problems then you might want to do the same thing (if you're not doing that already). Does the MQX driver support chained buffers or does it require 1600-byte receive buffers?

The reason for "256" is due to this note:

21.4.24 Receive Buffer Size Register (EMRBR)

To minimize bus utilization (descriptor fetches), it is recommended that EMRBR be greater than or equal
to 256 bytes.

They don't mean "minimize bus utilization" for your benefit. They mean "the microprogrammed machine which is doing all the Ethernet work runs out of steam if you ask it to try and get a new ring buffer more often than once per 256 data bytes (which at 100MHz is 20.48us)". You can always make hardware more efficient and faster, but it costs money to do that, and there's no reason to be faster than the specification allows.

And since the specification allows the designer:

> The maximum rate for minimum size Ethernet packets at 100MHz is ((72+12) * 8)/100 = 6.72us

You can be pretty sure the hardware isn't going to be able to pick up new buffer descriptors (to receive new messages into) too much faster than that.

This chip also has to manage the memory-based ring buffers AND a data-FIFO with high and low water marks. The minimum packet size assumption may well be baked into that too.

There's another thing to watch out for too. When the CPU accesses RAM it may slow down the Ethernet controller. RAM is a shared resource.

Check "14.6.3 Bus Master Park Register (MPARK)" settings and also how you have the two halves of the SRAM assigned. If you can keep the FEC in one half and the CPU in the other half (especially its STACK) then that might make things better (Table 12-2. RAMBAR Field Descriptions (continued)).

You don't want to have the CPU accessing the FEC's registers any more than absolutely necessary. Sometimes CPU reading of those registers can slow a peripheral down. In one famous (old) case, the CPU polling of a common floppy controller would cause it to lock up and not be able to read the disk at all. This old characteristic is likely to be in new hardware too.

Tom

0 Kudos
2,433 Views
TomE
Specialist II

What hardware are you using to transmit those short packets? Which manufacturer's silicon lets you do that?

My assertion that "Ethernet packets have to be padded" had me worried. So I went trying to find some proper references. The "usual place" (Wikipedia) says that "Runts are shorter than the minimum size and are rejected" and that CSMA/CD requires the minimum length to detect collisions.

Ethernet - Wikipedia 

That's not good enough for me. It took hours to finally get "the right answer", but I'll leave that until later. Since we're all running 100MHz full-duplex Ethernet through Switches, nothing should be colliding any more, so that justification for the minimum packet length doesn't exist. So maybe it isn't true any more.

The original specifications for Ethernet is IEEE 802.3. I've got a bound copy of

IEEE 802.3 (Draft, 1984)

    4.4 Specific Implementations

      4.4.2 Allowable Implementations
      4.4.2.1 Parameterized Values
        The following table identifies the parameters
        that shall be used in the 10 MB/s implementation
        (type 10BASE5) of a CSMA/CD procedure.

           minFrameSize 512 bits (64 octets)

That (yes, 1984!) applies to 10BASE5 (the old "Yellow Hose Ethernet", but not specifically to anything else. But I have the printed 10BASE-T Supplement, and it says:

IEEE 802.3 Supplement Section 13 and 14 (10BASE-T) (1991)
     Revisions to 802.3 1190 edition
       4.4.2.1 Parameterized Values
         In first sentence, delete "(type 10BASE5)".

Instead of saying "the Parameters for this standard are..." or "They're the same as the 10BASE5 ones" it says "Delete this bit in the original so by omission it now applies to the new standard". IEEE Pokemoni - "Gotta Catch 'Em All" as they all reference each other!

I could spend a few days (and a thousand dollars or more) getting all of the updates to the standards and seeing which bits of other standards they delete, but that's no fun: See how many there are:

Ethernet physical layer - Wikipedia 

Instead I'll look at the Ethernet sources and see how the drivers enforce (or not) the minimum packet length there.

Linux Sources:

include/linux/if_ether.h

#define ETH_ALEN 6 /* Octets in one ethernet addr */
#define ETH_HLEN 14 /* Total octets in header. */
#define ETH_ZLEN 60 /* Min. octets in frame sans FCS */
#define ETH_DATA_LEN 1500 /* Max. octets in payload */
#define ETH_FRAME_LEN 1514 /* Max. octets in frame sans FCS */
#define ETH_FCS_LEN 4 /* Octets in the FCS */

net/core/pktgen.c: pkt_dev->min_pkt_size = ETH_ZLEN;

As for where the above definition is used, it gets VERY interesting. The drivers are all in "drivers/net/ethernet", and some of the drivers use the above definition, but others don't. Specifically I can find 171 references, some of which look like this:

3com/3c501.c: if (len < ETH_ZLEN)
3com/3c501.c: pad = ETH_ZLEN - len;
3com/typhoon.c: * it with zeros to ETH_ZLEN for us.
8390/axnet_cs.c: u8 packet[ETH_ZLEN];
8390/axnet_cs.c: send_length = max(length, ETH_ZLEN);
8390/axnet_cs.c: memset(packet, 0, ETH_ZLEN);
8390/lib8390.c: char buf[ETH_ZLEN];
8390/lib8390.c: if (skb->len < ETH_ZLEN) {
8390/lib8390.c: memset(buf, 0, ETH_ZLEN); /* more efficient t
8390/lib8390.c: send_length = ETH_ZLEN;
amd/7990.c: len = (skblen <= ETH_ZLEN) ? ETH_ZLEN : skblen;
amd/7990.c: if (skb->len < ETH_ZLEN)
amd/7990.c: memset((void *)&ib->tx_buf[entry][0], 0, ETH_ZLEN);
amd/a2065.c: if (skb_padto(skb, ETH_ZLEN))
amd/a2065.c: skblen = max_t(unsigned, skb->len, ETH_ZLEN);

The controllers that use the definition include "3com: 3, 8390: 7, amd: 28, atheros: 4, brocade: 1, fujitsu: 9, hp: 1, i825xx: 32". The ones that DON'T reference this include "adaptec, adi, aeroflex, alteon, apple, broadcom, cadence, calxeda, chelsio, cirrus, cisco, davicom, dec, dlink, freescale, fujitsu, hp, i825xx".
Checking one of the ones that does include this reveals the following comment and code:

drivers/net/ethernet/amd/lance.c:
    /* The old LANCE chips doesn't automatically pad buffers to min. size. */
    if (chip_table[lp->chip_version].flags & LANCE_MUST_PAD) {

"Automatically pad"??? Note none of the Freescale Ethernet drivers reference the limit. That's because:

MCF52259 Reference Manual, FEC Chapter:
21.5.7 FEC Frame Transmission
Transmit logic automatically pads short frames (if the TC bit in the transmit buffer descriptor
for the end of frame buffer is set).

You don't have to worry about padding in a FEC driver because it is an automatic function of the hardware. You can't send a short packet through a FEC. Well, you can - you just have to calculate the CRC yourself and clear the "TC" bit. But you're not meant to.

So if you're sending "short packets" on the wire, you're using some other manufacturer's Ethernet chip to do this. One that looks like it requires the Driver to do the padding. And you're not doing that.

> It took hours to finally get "the right answer", but I'll leave that until later. 

Which is now. If you're running full duplex, why do you still need to obey the minimum packet length? Because Ethernet is backwards-compatible and has never changed the standard to "orphan" previous generations. You can still plug an old (or small and cheap, and usually old) device that only supports 10BASE-T into a switch, and it will still work. The switch will receive the packet at 100MHz and transmit it unchanged out the other port at 10MHz. It expects the "minimum packet length for collision" to still work. You can ever bridge back to 10BASE2 and even a 2.5km long 10BASE5 network, and they still should work. That means the 100BASET (or even Gigabit) packet your hardware sends may end up as-is, same length and with the same CRC on a 1990 era coaxial Ethernet. And it still has to have that padding for the collision detection to work there. That's why it is in the standard, and why all current hardware is allowed to assume it is there.

You could argue that the chips should be able to handle short packets if you aren't bridging back to old networks, but the chips were neither designed OR TESTED with short packets, Anything not tested doesn't work. Anything tested still doesn't work, but you're meant to try and fix it.

Tom

0 Kudos
2,433 Views
TomE
Specialist II

That being said (having the packets too short), that may not be the real problem.

The real problem may be that the FEC Driver code may not be able to recover from receive overruns properly.

If you make the packets longer there's more time available to process them. If you add more buffers then the hardware will be able to buffer the 5 packets without overflowing. Both of these may appear to "fix" the problem, but it may still be there waiting to come back. You only need to receive "buffers-plus-one" messages while the firmware is busy doing something else and not able to process the packets.

I would suggest that you test for this be adding a deliberate "stall for a few milliseconds with interrupts disabled" loop in the code and then use "sudo ping -f -l <preload> -c <count" or equivalent to force an overrun with legal length packets. The driver will either be able to recover from this (so the problem is really the short packets) or it won't (so there's a driver bug).

Tom

0 Kudos
2,433 Views
jeremym
Contributor IV

Thank you for the great research and info Tom; by freeing up some RAM and adding more RX buffers we are able to pretty much eliminate the lockup.  It's not really satisfactory in that it doesn't address the root issue, but I am unable to find any obvious faults in the driver.  Stopping the debugger during the failure seems to point to the FEC being truly 'locked' up on the RX side, since the BD's appear to be left in a state in which the FW/driver has done it's things and the FEC is failing to receive any longer (all BD's and associated buffers are free and available).  I really appreciate the time you spent helping on this.  

0 Kudos
2,433 Views
TomE
Specialist II

So many unanswered questions. Mainly most of the ones I asked. Some are to try and help, others to find out what you're doing with the product in a general way.

Were you sending packets shorter than the legal minimum length? Are you STILL doing this? Did changing that fix anything?

Did you test with a "flood ping"?

Does this "lockup" correspond to a FEC Overrun (OV bit in an RX BD)?

You've made the queue deeper, which may mean that if more than one other device on your network sends those "back-to-back" packets then the problem will happen again.

Do you have a "recovery" strategy in place? Do you have a way of detecting the FEC Receiver going off line (like no inbound traffic for a period of time)? Do you have a way to get it working again other than resetting the CPU. Or are you resetting the CPU?

Have you tried dumping ALL the registers (you should be able to do that from the Debugger) when it is working and when it is locked up and look for any differences you haven't noticed otherwise?

I can't check the source of your driver as you have to get a license to MQX in order to read the sources. So I can only suggest what to look for in there.

Looking at our driver and the Linux one, and also reading the FEC chapter in the Reference Manual, I have a suggestion.


When it fails, what is the content of the RDAR register? Is the RDAR bit in that register set or clear? Can you write to the RDAR bit again - you can do this from the Debugger if you're using that to cause an overrun. Does that get it working again?

It is essential that the Driver write to the RDAR register after servicing the Receive Ring and freeing up Receive BDs. If it isn't doing that, the FEC will work until it gets an Overflow. Then it will stop polling the ring and waits until the RDAR register is written. That's the trigger for it to start reading the ring again. If that is missing or wrong in the MQX code (especially in the path in the code that detects an Overflow) then that may be the problem.

Tom

0 Kudos
2,433 Views
jeremym
Contributor IV

I can try to answer some of the questions - 

1) We have many V2 and one V4 coldfire in the system - as your research shows, all pad packets to meet the 'minimum' requirement.  Wireshark can be deceiving at times with length and we were getting fooled by what was reported - our in-house packets are 'too short' but were being padded appropriately, so I don't think length of packet was the issue.

2) Ping floods don't produce the failure per se, but sending any packet(s) back-to-back in a 'flood' type fashion causes the lockup.  

3) RDAR bit is set - as noted previously, all debugger registers appear to be 'fine' - it simply acts as if interrupts are internally disabled; all the registers of the FEC seem to have appropriate values when captured in 'good' and 'failed' state.  The RX FEC simply seems to be locked.  Transmit FEC is OK.

4) For recovery we have a 'master' V4 CPU which monitors the many V2's in the system (up to 64).  We are able to detect the issue and take appropriate action.  The MQX driver is incorporated in the firmware in such a way that resetting just the FEC is more difficult than just letting our 'master' detect the issue and reset the V2 in question.

5) time to market ultimately dictates how much time I can spend chasing this, that is why I greatly appreciate your help, but our system is so much larger than this single issue, which has just popped up after seven years of working with this part and OS and firmware, that sometimes workarounds are all we can afford to implement.  We are tasking our V2's well beyond what they are scoped for, but have no other alternative at this time.  Increasing the RX buffers does only push the issue off, but future products will utilize newer processors with much greater resources and better OS's (Linux) and so this product represents probably the last push with the V2.  So if we are able to limp this V2 codebase through this release we will not have to re-address it.

6) I don't see a correlation between the overrun (or discarded packets for that matter) when the lockup occurs.  This seems really strange to me because I would have suspected that we would get a predictable number of discarded packets, and/or overrun, prior to the lockup each time, as our entire system is quite deterministic during this particular failure mode.  I did do an experiment where I put a delay loop (for loop with 1000 cycles) in the FEC RX ISR and it locks up almost immediately with high traffic.  All the FEC RX ISR does is clear the pending interrupt and attempt to move the data into the ring buffer(s), as long as the RX interrupts bits are set.  I am not sure what to make of it; this is as close to the HW as we can get, and by stalling the clearing/handling of the RX interrupts it essentially locks up hard.  My gut feeling when encountering this is that something is wrong with the FEC - I would think that it would simply get overrun (packets would be lost) but then continue to operate.  Maybe the driver is NOT taking an action which is required once this has occurred, but I am not seeing/reading any action it is supposed to take?

*** Attached is diff of all FEC registers good on left, failed on right.  There is no difference between any of the primary (non stats) registers with the exception of the ETSDR address which differs in the failed case because we have less RX buffers (to make the error occur more quickly).  In the good case we have 6 RX buffers, in the failed case we have 2 RX buffers.  

In this failed case RX interrupts cease to fire - the unmasked RX/TX interrupts are equivalent in both the working and failed states and there are no interrupts (that we handle via the MQX driver) pending.   

I am more inclined to believe that this is either a HW FEC error on the RX side, or we are setting up one of the FEC registers in an inappropriate way, and having more RX buffers is masking the issue.

0 Kudos
2,433 Views
TomE
Specialist II

> RDAR bit is set 

It should only remain set for a very short time. That's a signal to the FEC to run through the Receive Descriptors. When it finishes that it clears the RDAR bit. It is either stuck running through the descriptor ring, not finding what it wants, or something has gone wrong to stop it from reading the ring at all.

EDIT: No, I got that wrong. RDAR remains set as a signal to the FEC that it can start using the free descriptors. As it receives messages it steps through the ring, expecting the next one to be empty. It only steps (and reads a ring entry) when it needs one for an Ethernet message. When it finds the next one isn't empty, then it is stuck. THEN it clears RDAR as a signal that it has stopped, and won't bother looking or polling until triggered when the driver frees up some descriptors and sets RDAR again.

> future products will utilize newer processors with much greater resources and

> better OS's (Linux) and so this product represents probably the last push with the V2.

The promise is that you can spend all your tine writing useful "Application Code" to run on top of that OS. My experience is that you can spend half your time fighting the now very complicated development system (compilers, configuration, build environment) and the rest fighting bugs in the OS drivers, or problems caused by the delays and very non-real-time nature of Linux. We're using the i.MX53, and there were bad bugs in Ethernet, CAN, SPI, PWM, GPIO, porting and problems with the file system. The 20-second boot time was also a killer. Being able to update the software where the Kernel, Modules, root file-system, application code and startup scripts have to be updated on a live system is difficult.

> All the FEC RX ISR does is clear the pending interrupt and attempt to move the data into

> the ring buffer(s), as long as the RX interrupts bits are set

Interrupts on the CF2 have a bad problem. All interrupts have to be manually programmed by you (or maybe MQX makes sure this is right) with unique priorities and levels. Get this wrong and it can generate bad interrupts. There are also a lot of interrupts from the Ethernet chip and if they're on different levels they can interrupt each other.

The wonderful Motorola "write one to clear" system is wonderful, but confuses people who are used to other interrupt controller systems (ARM/Intel).. The usual way to handle an interrupt is to READ the IER, and then write the exact value read back into the IER. Then you fall into the code to service the interrupts in the COPY that you read.

The code we use in an MCF5235simpler than that and has been reliable:

__attribute__((interrupt_handler))
void fec_rx_isr( void )
{
    mip_pbuf *p;

    /* Clear interrupt from EIR register immediately */
    MCF_FEC_EIR = ( MCF_FEC_EIR_RXB | MCF_FEC_EIR_RXF );

    while (fec_input_ready())
    {
        /* fec_input() allocates mip_pbufs. */
        if ((p = fec_input()) != NULL)
        {
            mip_list_insert(&f_sFecCb.fifo, mip_pbuf_list(p));
        }
    }
}
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The "fec_input_ready()" function just looks at the EMPTY bit in the current descriptor. The "fec_input()" function unloads all the filled BDs and then sets RDAR.

We only have the RXF and MII interrupts enabled.

EDIT: I've had a problem with another device that could lead to this problem. The registers are a "shared resource" between the hardware (FEC in this case) and the main CPU. When there's only one "writer" of a particular register, everything is (mainly) fine. When there are two writers (the CPU and the FEC independently performing operations on the same register) you get trouble. That's why designs (not Motorola/FSL) that expect the CPU to read a register, clear one bit and write it back are bad if the hardware can get in-between the read and write and set a different bit. It will be accidentally cleared.

The FEC doesn't have that hazard, but if you're writing a one to the RXF bit (in order to clear it) at the same time as a new frame has come in and the FEC is trying to set that bit, then which device is going to win that one? You might never get another interrupt. The "safe" thing to do is to clear the interrupt FIRST and then read the descriptors and clear them. If you clear them and THEN clear the interrupt bit, then that might cause this situation. Check to see what order the code is doing this in.

I assume you have the descriptors and the buffers statically set up, as distinct from rotating a pool of buffers from the descriptors to the user code. With these CPUs you don't have enough RAM for that. I also guess you're reading the data from the buffers into a separate ring-buffer, and then reading from that buffer in your code. That means each byte is being handled at least three times. That copy operation may be using too much bus bandwidth and may be blocking the FEC from reading the descriptors or writing further data into the buffers. You could make your data copying less efficient (put some delays or NOPs in your copy loop) and see if it makes a difference.

Tom

0 Kudos
2,433 Views
TomE
Specialist II

> All the FEC RX ISR does is clear the pending interrupt and attempt to move the

> data into the ring buffer(s), as long as the RX interrupts bits are set.

I hope it isn't doing that, as that could be causing your problems. It is also the wrong way to handle this chip.

With something like a UART that has a "Data in FIFO" bit then your code should keep reading the FIFO while that bit is set. The "RX Interrupt Status Bit" does NOT say "there's still data to read" and the FEC doesn't know what you've read from the ring yet. That bit is only set when a new message arrives.

The Receive Code should work like the example I gave did. It should take the interrupt, clear it, and then walk through the ring from the last one (which was empty the last time there was an interrupt) until the next empty one. Then it should set RDAR and exit the routine, waiting for the next interrupt.

If the FEC driver is "polling on the RX interrupt", then is that your code, or is that the way the MQX driver in their "RTCS" package does it?

Tom

2,434 Views
jeremym
Contributor IV

The MQX driver does nearly the same thing as your example; it does check for available available RX buffer-descriptors prior to the empty check though.  One thing I noticed is that the MQX driver currently only enables the TXB and RXF interrupts.  The comment in the code states that TXF and TXB are co-generated, but that RXB is NOT co-generated with RXF.  But then the comment goes on and states that RXB is purposefully NOT enabled, nor needed, because it results in HBERR interrupts firing.  According to the Ref. Manual RXF indicates that frame received and last corresponding buffer describer has been updated.  RXB indicates receive buffer descriptor, not the last in the frame, has been updated.  

I wonder if not enabling RXB is resulting in missing a critical buffer-descriptor servicing operation by the firmware.  Even though I didin't see any buffer descriptors left in a 'bad' state during the failure, I may be missing something more subtle.

I am going to experiment with putting RXB back in and handling HBERR should they fire.

*** UPDATE: Enabling the RXB interrupt and the HBERR interrupt seems to have resolved the issue.  The HBERR handler simply clears the interrupt.  I was able to 'survive' the packet flood and recover as expected.  MQX driver has the HBERR handler coded, and it remains a mystery as to why they did not leave the RXB enabled along with the HBERR.  Maybe there will be other side-effects, but for now we will move forward with this enhancement and do integration testing before we commit the change to the mainline.  The comment in the MQX code to the effect that RXB is not necessary appears to be true as long as you don't overflow/overrun your RX buffers.

Tom - thanks for you help - your persistence on this issue key.

0 Kudos
2,433 Views
TomE
Specialist II

Congratulations.

It might be worth telling Embedded Access about this, and maybe asking for details on how this should be handled properly, as well as asking what the HBERR thing is about.

> The comment in the code states that TXF and TXB are co-generated,

> but that RXB is NOT co-generated with RXF.

That's useful. The Reference Manual doesn't mention that at all. RXF and RXB are only meant to matter if you're using small buffers that the FEC has to chain for large packets. Maybe when it overruns it legitimately can't set RXF as it hasn't "finished a frame".

I'd expect to see this in the latest Linux Driver code, but it still only enables RXF and not RXB.

http://elixir.free-electrons.com/linux/v4.12.9/source/drivers/net/ethernet/freescale/fec.h#L377

Here's a thread on the i.MX6 where the receive queue is getting "stuck". I haven't read it thoroughly enough to see if it matches your symptoms - I don't think it does, but it is interesting.

https://community.nxp.com/thread/322882

I've read the chapter again and again, and it doesn't tell me what happens when the FEC runs out of free receive buffer descriptors. There's no specific "overflow" status bit in a control register to indicate this. There's an "OV" bit in the Buffer Descriptor, but that only says the FIFO overran. The FEC has to have a free descriptor to write this bit into, and if it has run out of descriptors, what does it do? Set the OV bit in the PREVIOUS Descriptor? It can't do that.

I've read through the i.MX53 FEC chapter. It looks like it is a copy of the Coldfire ones, and doesn't clear this up. I've read the i.MX6DQ Gigabit FEC chapter. The chip is now extremely complicated, but there's nothing saying what happens when it runs out of descriptors, how it signals this error (if at all) and how to recover from it.

> But then the comment goes on and states that RXB is purposefully NOT enabled, nor needed,

> because it results in HBERR interrupts firing.

That doesn't make sense, as HBERR is only meant to be generated if TCR[HBC] is enabled, and that should only happen on a TRANSMIT and not a RECEIVE. The chip must have a strange bug if this is true. One that isn't mentioned in Revision 3 or Revision 4 of the Reference Manual, the Reference Manual Errata or in the Chip Errata.

Read the following for other 8-year-old problems in those manuals and the "Bear PIT" problem. Follow the last link for a laugh:

https://community.nxp.com/message/879461?commentID=879461#comment-879461

Not your CPU, but in this one the Revision 3 Manual has the correct information, but the Revision 4 one doesn't, and that was never fixed:

https://community.nxp.com/message/307125?commentID=307125#comment-307125

Here's where I found a manual had a corrupted table and missing pages.

https://community.nxp.com/message/59948

Tom

0 Kudos
2,433 Views
TomE
Specialist II

You're not the only one having problems with the FEC.

Over in the i.MX forum they\re having very similar problems with the FEC locking up after it has run out of Receive BDs.

This thread ran from April 2014 to October 2016 with NO RESOLUTION.

i.MX6 FEC stops generating receive interrupts 

Here's one where setting RDAR gets it working again. That indicated a bug in the driver that may not be setting RDAR at the proper time or a bug in the chip where a set of RDAR doesn't work for some reason.

FEC on imx6q stops receiving packets 

Here's one documenting "incoherent" (late) updating of the RX_EMPTY bi.

https://community.nxp.com/message/834646 

This one is about packet loss when transmitting. The customer had a simple case demonstrating the problem and it was elevated to Freescale support. But then got no further.

FEC ethernet packetloss 

Tom

0 Kudos