How to fix the cache coherency issue between dma and processor(on MCF5372 fec)

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

How to fix the cache coherency issue between dma and processor(on MCF5372 fec)

994 Views
sambuharikrishn
Contributor I

We are using the MCF5372  fec.

How to fix the cache coherency between dma and processor which access the descriptors in external memory

Labels (1)
0 Kudos
1 Reply

777 Views
TomE
Specialist II

There are a lot of different solutions. The one to use depends on your competence, the amount of coding effort you can afford and how much performance you need.

You should trade off as much performance as you can afford to get as simple a solution as you can get.

Listing "the usual suspects" in the order "Slow and Easy" to "Fast and Very Difficult":

  1. Turn the Data Cache off.
  2. NEVER put the Descriptors in External Memory. That's a very bad idea. The Descriptors should always be in internal uncached SRAM.
  3. Put (say) 16k of FEC Data Buffers in the SRAM. Copy the data from your external memory buffers to and from these buffers, or rewrite the code to only use these buffers (and be properly flow-controlled). Have the FEC only receive and transmit from these SRAM buffers.
  4. Put all of your FEC Data Buffers (and the Descriptors if you must, but these should REALLY be in the SRAM) into a block of your external memory that has the cache disabled. This chip doesn't have an MMU and only has three registers (CACR, ACR0, ACR1) to allow definition of cacheable and uncacheable memory regions, so this is tricky and can be impossible if you have too much memory out there [1].
  5. If you must have maximum performance, then you need to rewrite your FEC drivers to flush the cache after writing to transmit buffers (or use WRITETHROUGH mode for the cache), and to invalidate the cache before putting read buffers back into the descriptors [2][3].

Note 1: Set the CACR to "uncached everything". That covers your external FEC buffer memory and the IO page. Use ACR0 to cache-enable half of your external RAM. Use ACR1 for any other cacheable memory you have (external FLASH for instance). Otherwise you can set CACR to "Cache Everything", burn ACR0 to cache-disable the peripherals, and then use the one remaining ACR1 to cache-disable a block of your external SDRAM. That means you've nothing left if your FlexBUS has memory on it that has to be cache-disabled (like FLASH you're programming), but you might be able to use ACR0 to cache-disable the peripherals and all of the FlexBUS if that suits.

Note 2: The Reference Manual in section "5.3.6 Cache Coherency" explicitly says "Therefore, on-chip DMA channels should not access cached local memory locations,". But it doesn't say how to do that. Your problem.

Note 3: You use the CPUSHL instruction in a loop to invalidate a cache line at a time. Unfortunately this chip doesn't have a "cache invalidate" instruction, so you you have to push potentially useless data back to main memory in order to have a free buffer to read into. If you have dedicated separate rings of FEC Read and Write buffers, and you never write to the Read buffers, then the CPUSHLs have to be called, but execute really quickly as there should be no data to push back.

I was going to suggest looking at the Linux code, but it only runs on CPUs that have OS support for handling the cache, and so there's no "sample code" to help you here:

Linux/drivers/net/ethernet/freescale/fec_main.c - Linux Cross Reference - Free Electrons 

It might make sense to buy a commercial embedded Operating System where someone has already done all this work for you and debugged it. Debugging this sort of thing is going to be ridiculously difficult.

Here's some performance tricks.

If you're not running a multi-threaded system, and only have one stack, then put that stack in the SRAM. The CPU runs a lot faster if you do that. Code that uses the stack a lot can go a lot faster like that. If you are multi-threaded, find the one bashing the stack the most and put its stack in SRAM.

There's a lot of benefit in optimising the memory copy. The library memcpy() may be fairly good or it may be utter rubbish. You have to find the sources (or just disassemble your code) to see how good or bad it is. The fastest SRAM memory-copy on this chip is to use "movem.l" to move 8 32-bit chunks from SDRAM to the registers, then use movem.l to push that TO THE STACK IN SRAM. Repeat for at least 1k. Then burst from SRAM back to SDRAM. That is really a lot faster than copying from SDRAM to SDRAM. I got the raw memory copy speed up for 30 MB/s to 55 MB/s using this method. Details here:

https://community.nxp.com/message/60739?commentID=60739#comment-60739 

You should benchmark Writethrough versus Writeback cache modes. The code I was working on does a lot of copying to video buffers, and so Writethrough means I don't have to worry about cache coherency. It is faster to use Writethrough if doing a lot of memory copying, as it can keep the SDRAM page open. With Writeback all your SDRAM writes are pretty much "random" and that hurts memory bandwidth. But if you have your stack in SDRAM then writethrough makes all your stack operations a lot slower.

.

Tom

0 Kudos