K60 Ethernet transmission problem

mjbcswitzerland · ‎10-18-2011

Hi All

I wonder whether anyone has experienced the following?

I am sending TCP data from an HTTP server, whereby the following effect occurs every time for a certain image:

1) The transmission takes place by sending TCP frames of maximum IPv4 length (about 1400 bytes each)

2) TCP windowing is in operation and so Ethernet frames are sent quickly after another [that is one frame is placed into a buffer descriptor and the TDAR is written to start polling, then another one is written and the TDAR is again written, etc.].

3) The final frame in the test image is a shorter frame (about 700 bytes) since it is the last chunk of the data.

4) It happens that the last activity is that two TCP frames are sent quickly (1400 bytes followed by 700 bytes) using the same technique as the rest of the image, which works otherwise as expected.

5) The effect (presumably something to do with the last frame length or a timing issue since the last frame can be prepared slightlly faster) is that the first 1400 byte long TCP frame is sent but the second (700 byte frame) is not sent.

This results in a TCP re-transmission after which (with a slight delay at the end) the image has been successfully transmitted and displayed.

But, on closer observation it is found that the second frame is not lost but it doesn't get sent until a repetition is made. When the repetition is made, the 'lost' frame is sent, followed by the repeated frame which has just been generated by software. Thsi means that the second frame was waiting in the output buffer but didn't get sent, but got 'released' by following activity.

To prove this I could set a breakpoint in the TCP code which 'wanted' to repeat the 'lost' frame and simply write something to the TDAR register using the debugger; immediately the "waiting" frame is indeed sent!!

6) I am using the EMAC in its compatibility mode (compatible to the one in the Coldfires), where the same code doesn't have this problem.

7) If I disable the IP accelerator at the transmitter (check sum offloading disabled) the problem is no longer seen (this suggest that it is probably a timing issue since the SW takes more time to prepare the following frame and possibly then misses a race-state timing issue. Alternatively it could be due to the store and forward operation which needs to be enabled to use it - I have the FIFO watermark set to 64 bytes.

My first conclusion is that it is possible for the Ethernet transmitter to stop polling its buffer descriptors, although the TDAR register is written with a frame waiting to be sent. This only happens when a shorter frame follows a full size frame (full size frames following each other don't have the problem - another test image that is slightly larger terminates with two TCP farmes of 1400 and 900 bytes and this is served perfectly every time....!).

Has anyone experienced the same, or similar? Was a workaround found?

Note that I am using an 0M33Z mask (first version) but didn't see any errata about such a thing.

Regards

Mark

mjbcswitzerland · ‎10-18-2011

Hi All

To be sure that this problem was not to do with legacy mode I updated the driver to work with the enhanced buffer descriptors.

The results were however identical.

Although everything basically works (apart from the occasional repetition as previously detailed) I was surprised that the enhanced reception status indicates that every Ethernet frame is being received with a MAC and PHY error. Is this 'normal'?

Regards

Mark

mjbcswitzerland · ‎10-18-2011

Hi All

I have managed some improvements but I have to admit that there were/are some strange things going on.

1) First I tried retriggering the buffer descriptor polling in the transmit interrupt routine. This also solved the problem with the final short frame in a sequence being missed out.

However it was not a total solution since there were still some strange things happening.

2) What I noticed was that there was a certain page on the web server that was not being read correctly (from file system in internal FLASH). After some debugging I realised that it was read normally when stepping with the debugger but not when allowed to run full speed. However it was also clear from the debugger that the Flash content was sometimes being displayed with seemingly random values at certain locations.

The problem started at 284k in Flash - below that it was stable...

After looking through the flash erratas I disabled speculation (this didn't help) and then disabled cache. With disabled cache it suddenly looked much better (see errata e2644 and e2647, which are valid for my 512k chip).

3) Since things suddenly were working rather better I found that I could disable the workaround 1) and it still worked well. (Showing a relation between the errata and ENET in some way).

4) But then the next suprise was found. Yesterday I had disabled the use of memory to memory DMA because I thought that it was possibly a reason for some strange effects and after disabling it then it did look to improve matters (so I put it on the debug list :smileywink: ).

When I re-enabled it with the two Flash workarounds it did work, but was EXTREMELY slow... Now the DMA transfer was being used to transfer web pages into the Ethernet transmit buffers and it was taking almost 1s to complete a DMA transfer of 1400 bytes!!

It I re-enable the cache operation the DMA would be normally fast, but the other problems re-appear (of course). Disabling cache the Ethernet works fairly well again but the DMA slows things down to a snail's pace...

So I had to disable the memory to memory transfer via DMA to get the the fairly good situation again.

5) I say fairly good because it is still not perfect. The original problem with the tx frames getting missed by the buffer descriptor polling has gone but there are still some lost transmit frames (can happen at almost any time - sometimes all works fine and sometimes there is a TCP repetition needed because of this).

What I have been able to demonstrate is that each transmission loss is accompanied with a miscellaneous Ethernet interrupt signalling a "Late collision". This doesn't happen when serving single large images (I can refresh single images repeatedly without and problems - they serve very fast), but it occurs when there are complex web pages with images being served in parallel with other stuff (eg. multiple images and content in parallel).

The late collision description suggest that this happens only on a half-duplex link, whereas I am on a full-duplex link with the PHY and EMAC set to full duplex operation.

Again, the same driver code and web server/content (on same Ethernet link) doesn't experience this when run on a Coldfire so I think that there may still be something strange going on...

Any similar experience or comments??

Regards

Mark

mjbcswitzerland · ‎10-19-2011

Hi All

As a final note, I can now say that the 100M full duplex operation is in fact very good. The late collisions were due to a mismatch in the settings (I had forgotten that I had disabled a bit of this due to other investigations to do with changing speeds and duplex modes on the fly and once it was activated correctly the operation was OK).

There may be some issues in 10M mode and also I had a diffculty with the MII management clock - which I may discuss in a new thread later.

Regards

Mark

cfgmgr · ‎09-10-2012

Mark,

I know this thread is pretty stale (a year!), but yes - I have experienced this too.

I am using UDP, not TCP, but I noticed that when sending two quick back-to-back packets, the second one (which is shorter in length than the first) doesn't get sent unless I retrigger the transmission in the TX ISR (via ENET->TDAR = ENET_TDAR_TDAR_MASK). I noticed too, that inserting any small delay in the process (e.g. an sprintf() to a debug buffer) was enough to get the transmissions to succeed.

I too am using hardware acceleration. Did you ever get to the bottom of this? I too am using 0M33Z mask.

Thanks,

-bill

mjbcswitzerland · ‎09-10-2012

Hi Bill

In the meantime I have been working with 120MHz K70s and 150MHz K61s but have always been using the workaround for retriggering the TDAR in an Tx interrupt routine. It has been possible to deactivate various workarounds with the new devices but I never tried removing this just yet. I will have a go at the next opportunity to see whether it is related to a revision or not (the interrupt is otherwise not actually needed so it woudl be nice to be able to remove it).

Regards

Mark

cfgmgr · ‎09-10-2012

Hi Mark,

Thanks for the quick reply! I guess I'll leave the workaround in as well until I can fix it properly. Someday we'll get to the bottom of it!

Incidentally, I too am having issues with "late collisions", but because of a rather unconventional way we have our ethernet modules daisy-chained together (TX of one device feeding the RX of the next, and so on in a big ring) my PHY chips always come up in 1/2-duplex, no matter how they are programmed. This is another problem (possiby related) that I need to look into as well...

Cheers,

-bill

PaoloRenzo · ‎09-20-2012

Hi All:

Even more late :-)

As Mark pointed out, this is a race condition with the xmit portion:

--------------------------------------------------------

static err_t

low_level_output(struct netif *netif, struct pbuf *p)

{

.... //Filling buffers and stuff like that

/* only one task can be here. wait until pkt is sent, then go ahead */

/* semaphore released inside isr */

/*start expiring semaphore: no more than 3 ticks*/

/*no blocking code*/

xSemaphoreTake( xTxENETSemaphore, 3/*1/portTICK_RATE_MS*//*portMAX_DELAY*/);

/* Request xmit process to MAC-NET */

enet->tdar = MACNET_TDAR_TDAR_MASK;

return ERR_OK;

}

-------------------------

ISR_PREFIX

void vENETISRHandler( void )

{

unsigned long ulEvent;

portBASE_TYPE xHighPriorityTaskWoken = pdFALSE;

#if (MACNET_PORT==0)

volatile macnet_t *enet = (macnet_t *)MACNET_BASE_PTR;

#else

volatile macnet_t *enet = (macnet_t *)(MACNET_BASE_PTR+MAC_NET_OFFSET);

#endif

/* Determine the cause of the interrupt. */

ulEvent = enet->eir & enet->eimr;

enet->eir = ulEvent;

/*Tx Process: only aware of a complete eth frame*/

if( /*( ulEvent & MACNET_EIR_TXB_MASK ) ||*/ ( ulEvent & MACNET_EIR_TXF_MASK ) )

{

/* xmit task completed, go for next one! */

xSemaphoreGiveFromISR( xTxENETSemaphore, &xHighPriorityTaskWoken );

}

/*Rx process*/

{

...//more stuff

}

/*Error handling*

{

...//more stuff

}

portEND_SWITCHING_ISR( xHighPriorityTaskWoken );

}

--------------------------------------------------------

After that fix I was not able to see other packets get stuck for next xmit.

Hope it helps there, probably not to you guys