We are using the MCF5372 fec.
How to fix the cache coherency between dma and processor which access the descriptors in external memory
There are a lot of different solutions. The one to use depends on your competence, the amount of coding effort you can afford and how much performance you need.
You should trade off as much performance as you can afford to get as simple a solution as you can get.
Listing "the usual suspects" in the order "Slow and Easy" to "Fast and Very Difficult":
Note 1: Set the CACR to "uncached everything". That covers your external FEC buffer memory and the IO page. Use ACR0 to cache-enable half of your external RAM. Use ACR1 for any other cacheable memory you have (external FLASH for instance). Otherwise you can set CACR to "Cache Everything", burn ACR0 to cache-disable the peripherals, and then use the one remaining ACR1 to cache-disable a block of your external SDRAM. That means you've nothing left if your FlexBUS has memory on it that has to be cache-disabled (like FLASH you're programming), but you might be able to use ACR0 to cache-disable the peripherals and all of the FlexBUS if that suits.
Note 2: The Reference Manual in section "5.3.6 Cache Coherency" explicitly says "Therefore, on-chip DMA channels should not access cached local memory locations,". But it doesn't say how to do that. Your problem.
Note 3: You use the CPUSHL instruction in a loop to invalidate a cache line at a time. Unfortunately this chip doesn't have a "cache invalidate" instruction, so you you have to push potentially useless data back to main memory in order to have a free buffer to read into. If you have dedicated separate rings of FEC Read and Write buffers, and you never write to the Read buffers, then the CPUSHLs have to be called, but execute really quickly as there should be no data to push back.
I was going to suggest looking at the Linux code, but it only runs on CPUs that have OS support for handling the cache, and so there's no "sample code" to help you here:
Linux/drivers/net/ethernet/freescale/fec_main.c - Linux Cross Reference - Free Electrons
It might make sense to buy a commercial embedded Operating System where someone has already done all this work for you and debugged it. Debugging this sort of thing is going to be ridiculously difficult.
Here's some performance tricks.
If you're not running a multi-threaded system, and only have one stack, then put that stack in the SRAM. The CPU runs a lot faster if you do that. Code that uses the stack a lot can go a lot faster like that. If you are multi-threaded, find the one bashing the stack the most and put its stack in SRAM.
There's a lot of benefit in optimising the memory copy. The library memcpy() may be fairly good or it may be utter rubbish. You have to find the sources (or just disassemble your code) to see how good or bad it is. The fastest SRAM memory-copy on this chip is to use "movem.l" to move 8 32-bit chunks from SDRAM to the registers, then use movem.l to push that TO THE STACK IN SRAM. Repeat for at least 1k. Then burst from SRAM back to SDRAM. That is really a lot faster than copying from SDRAM to SDRAM. I got the raw memory copy speed up for 30 MB/s to 55 MB/s using this method. Details here:
https://community.nxp.com/message/60739?commentID=60739#comment-60739
You should benchmark Writethrough versus Writeback cache modes. The code I was working on does a lot of copying to video buffers, and so Writethrough means I don't have to worry about cache coherency. It is faster to use Writethrough if doing a lot of memory copying, as it can keep the SDRAM page open. With Writeback all your SDRAM writes are pretty much "random" and that hurts memory bandwidth. But if you have your stack in SDRAM then writethrough makes all your stack operations a lot slower.
.
Tom