MCF54455 FEC FIFO overflows occurring

benfleming · ‎10-05-2016

I have an MCF54455 and I'm trying to debug an issue with the Ethernet FEC. When receiving packets, a large number of them end up arriving in my driver with the OV bit set in the buffer descriptor. The closest issue I have found on this site is How can we prevent a receive fifo overrun on an MCF5235 FEC which may be the same underlying issue, however I tried the suggestions there to no avail, so I suspect the root cause was not discovered. Let me describe my setup

MCF54455 running at 100MHz bus speed, KSZ8041NL PHY, MII connection

At both 10/100 Half/Full duplex this issue occurs.

Buffer Descriptors live in on chip SRAM, buffers are in SDRAM in a non-cached region

RAMBAR = 0x80000000 + 0x235 (oddly, I have tried both +0xE35 and +0x221 and not seen a significant difference)

I allocated 16 Receive Descriptors, 16 Byte aligned, and assigned them buffers of 2048 bytes, also 16 Byte aligned.

The Receive buffers flag fields get 0x8000 (E bit) set, and the final BD gets the W bit set.

MCF_FEC_RCR(ch) = (1518 << 0x10) | 0x24

MCF_FEC_EMRBR(ch) = 0x62<<4 //1568 to leave room in case FEC module overwrites end of packet

MCF_FEC_FRSR(ch) = 0x48<<2 // I've played with this as well, and it doesn't seem to affect how often the OV occurs. The manual says R_FSTART should be 0x48 or greater, it defaults to 0x40. I don't notice a difference in OV occurring, however.

On receive, my ISR starts at the last BD it checked previously and counts until it finds a BD marked with an empty bit (or has looped around). This count is saved and a task polls the count; if it's non-zero a receive function looks at the last BD previously read (different saved var from the ISRs 'last read'). It checks the flags, and if no error bits are set the buffer is processed, the BD gets a new buffer and is marked empty. The problem is that too often, the OV bit is set and the received packet contains some garbage. This occurs regardless of the packet size, though I think larger packets overflow earlier as a percentage of bytes (64 byte packet with 32 bytes invalid and 1518 byte packet with 1262 bytes invalid)

I have tried increasing my receive buffer descriptor count up to 256, but looking at my receive processing, the number of buffers in use doesn't go much past 32, and in any case, the overflows occur throughout the list, so I'm reasonably sure that I'm not running out of buffers.

Does anyone have any idea what could be causing these overflows?

Thanks,

Ben

TomE · ‎10-06-2016

Yes, I have an idea.

You seem to be doing the same thing in your code as the author of the post you referenced.

Which is, when you get an interrupt you scan ALL of the BDs.

That sort of programming seems to cause this problem.

This CPU is meant to have hardware to give the FEC priority (for bus access) over the CPU so it doesn't get starved. But it doesn't seem to work.

You should check AGAIN that you've got everything set up to give the FEC the best chance. Deepest FIFOs. Highest priority. BDs in the half of the SRAM where they're meant to have priority (or the SRAM set up so the FEC has highest priority on all of it). Are you sure the CPU is accessing the SRAM through the "back door" and not over the bus. On some of these chips you can get this wrong. Make sure RAMBAR[V] is set. Make sure the Address Space Masks are zero (except C/I). Prove this is working by writing test code that measures the SRAM access speed.

But the "priority" stuff doesn't seem to work if the CPU is rapidly reading uncached memory or SRAM.

Instead of the "brute force and ignorance" coding of scanning all the BDs, you should start from the "next expected" one. You should then read "around the ring" until the first "unused" one and then stop.

> I have tried increasing my receive buffer descriptor count up to 256,

That should make the problem worse, as you're fighting the FEC a lot more. Did it?

Are you running full-duplex? Can you force half-duplex and see if the problem goes away?

Tom

benfleming · ‎10-10-2016

Which is, when you get an interrupt you scan ALL of the BDs.

I wasn't clear in my first post - When I get an interrupt I start from the last BD I scanned, so the ISR usually touches at most 2 BDs (the one with data, and the next one, which is empty, terminating the scan). The location of the empty BD is then where it begins at the next interrupt.

This CPU is meant to have hardware to give the FEC priority (for bus access) over the CPU so it doesn't get starved. But it doesn't seem to work.

According to the errata for this chip (SECF016), the BDE and V bits should not both be set as simultaneous access occasionally results in incorrect data. This is supposed to be fixed in the mask for the chip I have, however. I've tried it both ways (BDE only, and BDE + V) without an apparent difference in overflows.

Deepest FIFOs.

I have varied R_FSTART between 0x70 and 0x20. This seemed reduce overflow occurrences for my 10 second test case when R_FSTART was at 0x60, however it did not eliminate them. This seems strange, as the higher the number, the smaller the portion of the FIFO allocated for receive. The default is 0x40. The datasheet recommends 0x48 or higher.

Highest priority. BDs in the half of the SRAM where they're meant to have priority (or the SRAM set up so the FEC has highest priority on all of it).

I have RAMBAR[PRIU,PRIL] set to [00] which gives priority to the SRAM backdoor for both banks. Again, I did not notice much difference when I flipped these to [11]. (With RAMBAR[V] bit set, of course).

Are you sure the CPU is accessing the SRAM through the "back door" and not over the bus. On some of these chips you can get this wrong. Make sure RAMBAR[V] is set. Make sure the Address Space Masks are zero (except C/I). Prove this is working by writing test code that measures the SRAM access speed.

Yes. I get about a 30% difference between CPU vs. backdoor.

Why does C/I need to be set? The datasheet says "In most applications, the C/I bit is set" but it doesn't say why. It seems these are most useful for power management, which I'm not concerned with.

But the "priority" stuff doesn't seem to work if the CPU is rapidly reading uncached memory or SRAM.

Isn't that the whole purpose of the prioritization for the arbiter?

Instead of the "brute force and ignorance" coding of scanning all the BDs, you should start from the "next expected" one. You should then read "around the ring" until the first "unused" one and then stop.

> I have tried increasing my receive buffer descriptor count up to 256,

That should make the problem worse, as you're fighting the FEC a lot more. Did it?

See above - it's not brute force, it starts at the 'next expected' location. Increasing or increasing the BD count didn't seem to make a difference either way.

Are you running full-duplex? Can you force half-duplex and see if the problem goes away?

I've been autonegotiating to 100/Full, but to check I've forced other modes

100/Full: Present

100/Half: Present

10/Half: Present

10/Full: Present

Placing descriptors in non-cached SDRAM, with buffers in cached SDRAM seems to make things much worse.

TomE · ‎10-10-2016

> Why does C/I need to be set?

The Reference Manual doesn't detail this at all. You need to read the original 1980's 68000 manuals where they told you how things work. When the CPU acks an interrupt it runs a "CPU Space Cycle". That runs a bus cycle which doesn't match the Supervisor/User or Read/Write bus indications. It matches the "C/I" cycle. The SRAM shouldn't try to participate in this but setting C/I is "to be sure to be sure".

> 10/Half: Present

Does "Present" mean that your problem is still present? You're running 10/Half and STILL getting FIFO overflows? That means you're running 10 to 20 times slower and it is still locking up or getting locked out?

You've got some other problem there.

At least if you can get it to fail at 10% "loading" you've got plenty of CPU time left over to start adding heavyweight diagnostics. I'd suggest adding some general-purpose "logging code", and start logging what your code is doing. To start with, log every FEC register access and every BD access. See if you get a pattern leading up to the FIFO overrun.

> Placing descriptors in non-cached SDRAM, with buffers in cached SDRAM seems to make things much worse.

That's an interesting data point. How are you handling the cached buffers? Are you flushing and invalidating single cache lines or the whole cache? Does the FEC overflow correlate with your Cache operations (log them)? Maybe the cache flush ignores any priority settings (and locks out the FEC for too long). Can you put your buffers in uncached SDRAM? Can you change from full cache flushes (if you're doing them) to address-based line operations? Can you set the cache to write-through instead of write-back (if you haven't already)?

Tom

benfleming · ‎10-06-2016

A little more information on the nature of the corruption: I have been providing the buffer descriptors with buffers memset to 0xEF. When a BD arrives with the OV bit set, the buffer contains, for the first portion, valid data, followed by a single line (16 bytes) which contains 4 bytes of garbage, followed by 12 bytes of the previous line. The remainder of the buffer after that contains the 0xEF pattern.

For example, from a packet that contains 16 a's followed by 16 b's, etc:

32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33 2222222222333333 // last good line
51 4B 03 08 32 32 32 32 32 32 33 33 33 33 33 33 QK..222222333333 // corrupted line
EF EF EF EF EF EF EF EF EF EF EF EF EF EF EF EF ................ // remainder of buffer still has pattern

Occasionally, I will get a packet where the last line is duplicated with the first 8 bytes corrupted instead. I suppose all this tells me is that there is a 32-bit bus used....

MCF54455 FEC FIFO overflows occurring

MCF54455 FEC FIFO overflows occurring

General