eDMA oddity on MCF54415

filipdossche · ‎05-16-2013

Hi,

In an earlier discussion Tom Evans pointed me in the right direction to implement some functionality I need.

In essence I have coupled back a PWM signal to the DREQ1 pin function to trigger data transfers from the flexbus to an SRAM location.

it works to a degree but there is something odd that I can't explain.

The PWM signal is a simple square wave with a frequency of 7.8125 MHz, the period is 128 nanoseconds.

The odd thing is that I do get the expected chip select signals on the FLEXBUS but only 3 out of 4 times

Furthermore: the 3 pulses I do get seem to "drift" away from each other.

This is how it is set up and a detailed description of what I see on my scope:

- PWM_B0 generates the square wave with a period of 128 nanoseconds.

- That signal is fed back to DREQ1. I have checked the signal with a scope and it is perfect, as expected the period is exactly 128 ns and very stable.

- CS1 is set up to activate when there is a read operation in the range 0x02000000...0x0200FFFF.

- the edMA module is set up to always read at address 0x02000000 and write the result to a fixed address in the processor's SRAM.

- The total cycle time for the read operation on the FLEXBUS is 40 ns (4 basic 8 ns cycles + 1 wait state cycle of 8 ns).

- After the first positive edge on the PWM/DREQ1 signal I see the chip select signal appear after a delay of 24 ns.

- After the second positive edge on the PWM/DREQ1 signal the chip select delay becomes 40 ns.

- After the third positive edge on the PWM/DREQ1 signal the chip select delay becomes 56 ns.

- After the fourth positive edge on the PWM/DREQ1 signal the chip select signal does not appear.

Having shown this behaviour after four cycles of the PWM/DREQ1 signal everything repeats over and over again, regular as clockwork.

Its almost as if the total time taken for the eDMA transfer is 144 ns which I find hard to believe: the read operation on the external bus takes only 40 ns, the transfer to the SRAM on the processor's internal bus should be a lot faster than that so it should be far below 128 ns to complete the transfer.

Any ideas/suggestions as to what might be going on would be very much welcome.

Message was edited by: Filip Dossche

I have done some further searching in the documentation and testing so I have found a few things.

1) It is not the positive edge on DREQ1 that triggers the DMA operation, it is the negative edge.

2) I have activated the DACK1 pin function to see exactly when the DMA transfer is confirmed.

3) The reason it is failing to generate a DMA transaction is due to the fact that the minimum transaction time does appear to be 144 ms.

The result is: it can't follow the 128 ns rate and lags by 16 ns each time. The fourth time the DREQ1 goes high before DACK1 does => it ignores that DMA transfer.

4) I have halved the frequency so it requests DMA transfers at 256 ns intervals and sure enough then it works. Not a single DMA transfer is missed any more. I get each and every chip select signal.

5) The timing between the various events is:

Assertion of DREQ1 => Assertion of DACK1: 40 ns.

DACK1 duration : 8 ns

Assertion of DREQ1 => Assertion of CS1: 90 ns.

CS1 duration: 16 ns (as expected).

It seems I am stuck with a minimum duration of 144ns which raises a few questions:

- Why does it take so long ?

- What is it the processor doing in the mean time: is it still executing code or does the eDMA module stop everything during those 144 nanoseconds ?

- Is there any way of speeding things up ?

Next thing I will try is getting the DREQ1 signal's period as close as possible to 144 ns.

Message was edited by: Filip Dossche

144 ns does not produce missing chip selects any more but the response to DREQ1 jitters a bit every few seconds.

160 ns works well though: no missing chip select signals and a very stable response time to DREQ1.

That will work for me but I am still curious as to the reason for the 144ns minimum DMA duration.

TomE · ‎05-16-2013

> In an earlier discussion

On trying to perform these transfers using the CPU alone:

https://community.nxp.com/message/328081

Make sure you're reading V3 of the manual. V4 gets the eDMA documentaiton wrong

Pick up a proper copy from here:

MCF54415 eDMA help

The eDMA is "microcoded". It has to execute internal instructions to perform the transfer.

19.5.1 eDMA Microarchitecture

When any channel is selected to execute, the contents of its TCD are read from local

memory and loaded into the address path channel x registers for a normal start and into

channel y registers for a preemption start. After the minor loop completes execution, the

address path hardware writes the new values for the TCDn_{SADDR, DADDR, CITER}

back to local memory.

All of the above takes time.

Are you programming it with Major Loops or Minor Loops? I assume you've got a single Major Loop set up to perform your line read. You can expect a very long delay at the end of a loop.

You could also increase the Minor Loop NBYTES value to perform multiple reads on a single request. You'll need to use TA for this to work.

Have you played with the Crossbar to give the eDMA priority AND to park the bus on the eDMA? Does that make any difference?

> square wave with a period of 128 nanoseconds. ... That signal is fed back to DREQ1

I don't like the sound of that. DREQ1 should only be asserted until DACK1 is driven and that should make DREQ1 go away. These signals should be interlocked in hardware (through a flip-flop). You could also add some logic in there so that TA is driven properly rather than using CS timing.

So then I'd suggest you program the DMA controller for 32-bit transfers on a 16-bit bus. The 128ns clock is divided by two to set a flip flop that drives DREQ1. The DMA controller will then perform two back-to-back 16-bit reads. The hardware will be ready for the first one, but you'll have to stretch the second one with TA. That should get you almost double the transfer speed. You might get lucky with your existing hardware, but I wouldn't bank on it.

You're running very close to the wind here. You've got no margins and no hardware FIFOs in there to handle the case where the eDMA cycle gets delayed. To work like this you'll have to ensure that nothing else can get in and run a bus cycle, delaying the eDMA access to the bus. If the CPU isn't locked into a loop where it will never get a cache miss, where it can never take an interrupt, where no other hardware (I'm looking at you, Ethernet!) can run a bus cycle ... you get the idea.

Tom

filipdossche · ‎05-17-2013

Hi again Tom,

As I mentioned in my previous reply I have thought of a radically different approach.

The idea is to connect both the Flexbus and the 24 bits of AD converter data to 3 external 32 kbyte SRAMS (about the cheapest types available)

The Address lines for the SRAMs would be multiplexed and either come from a 12 bit counter chip ( I only need +/- 2.5 kByte per SRAM) or from the latched FLEXBUS address lines, I would use the FB_ALE signal to latch them.

Other SRAM and ADC control lines would be generated from the PWM outputs, toward the SRAM I would foresee multiplexing with flexbus signals where necessary.

For the duration of the line scan Access to the SRAM's address, data and control lines comes from the PWM signals + extra counter/multiplexing logic. During that scan There is no Flexbus access and the processor just happily does whatever it is doing. All of the time critical stuff is effectively being done entirely independent from the microprocessor and handled by the automatically generated PWM signals + extra logic. Getting 100% precise and consistent write performance into the SRAM is trivial this way.

Once the scan is done (determined by an interrupt tied to a counter value for another PWM output which generates both the start pulse for the scanning sensor and serves as a pixels scanned counter) the multiplexing changes and the Flexbus can access everything. Then I start a single DMA request ( I guess in software) which has the required number of minor loops to transfer the data from the external SRAMS to the processor's DRAM memory. If there are delaying intermediate bus cycles or whatever it is irrelevant. As long as the transfer is done fast enough, which I am sure will happen, that is all right by me.

I then process the data, package it, send it to the network client and start a new scan when appropriate.

I like it: no timing issues because those are perfectly handed by the excellent mcPWM module, more time for the processor to do the necessary data processing/networking stuff and I use the Flexbus + eDMA module to maximum effect.

Thanks Tom, your advice was very very useful.

TomE · ‎05-17-2013

> I am not really looking for higher speed.

But you are looking for a big enough timing margin to guarantee the eDMA won't miss a read. Reading all the data in a single minor loop and using TA to generate "Wait States" would guarantee the timing. With the disadvantage that the eDMA would probably hog that bus for the whole of the scan line. That may or may not be a problem in your design. The Crossbar at least lets the CPU and other peripherals keep running while this is happening.

> 3 external 32 kbyte SRAMS

By using external SRAM you're building a FIFO (of sorts). But one with a lot of buses and multiplexers and tracks and so on. It may look easy on paper, but would take up a fair amount of board space.

FIFOs are available as components.

Cypress CY7C421, 512 bytes by 9. Push the data in one side of the chip as it is read from the ADCs, and have the eDMA (or the CPU frankly), read it out the other side.

http://www.cypress.com/?id=1962

The above are asynchronous FIFOs, probably easier to design with than their Synchronous ones, but check them as well.

Cypress also have multiport memories which would simplify your SRAM design, but be a lot nire complex than using their FIFOs.

I'd suggest looking at the FIFOs. Just transfer the data from the ADCs into the FIFO and then trigger the eDMA to burst-read data from them. These FIFOs are 512 bytes deep by 9 (so you need three).

If your line is less than 512 pixels wide then just fill them and then burst-empty them in one go. If the lines are longer than that, the FIFOs have a half-full flag, so connect that to the eDMA DREQ and have it trigger a 256-word eDMA read. It will keep reading 256-word bursts automatically. Read the remaining data to the end of the line with a manually-triggered eDMA transfer of the appropriate length.

Or just have the CPU read the FIFOs by monitoring the Full/Empty flags or by triggered interrupts. That is simpler to get working.

> As to network traffic: it is a dedicated Ethernet link between

> server and client, the Ethernet peripheral is used but the link

> is free of any traffic except for the UDP packets I intend to send to the client.

I'd still say you can't totally guarantee there will never be an Ethernet packet on the wire when your scan line is being read. Linux and Windows systems transmit all sorts of junk all the time. That will trigger the Ethernet controller to read some data from SRAM, and that transfer will block and delay the eDMA cycle, changing the timing and corrupting your data. AND without there being any indication that the timing went out and the data got corrupted. You should probably add some hardware to detect timing violations so you know if and when it went wrong.

Some Ethernet hardware regularly polls its ring buffers to keep track of them. So they might be stealing cycles even when there's no data on the wire.

Likewise, how have you stopped the CPU from performing any bus cycles that could interfere with the eDMA?

But, the best solution is the one you're looking at, buffering in hardware. You only need to buffer for longer than the worst possible eDMA stall. You could get away with 16 BYTE memories if they were still available - look up the ancient 7489 to see what I mean (not appropriate for your use, but I've used them for interesting stuff in the past).

Tom

filipdossche · ‎05-21-2013

Hi Tom,

Excellent suggestion, I had a look at various FIFO memory chips and it will make things a lot simpler.

The FIFO chips cost a bit more than standard SRAM chips but the simpler design more than compensates for that.

Thanks

filipdossche · ‎05-17-2013

Hi Tom,

Thanks for your comments, here are a few things I can clarify:

1) I am definitely using V3 of the manual so no worries there.

2) Well: I saw section 19.5.1 but did not realize it would take much time but now that you mention it, of course it would.

3) You are correct: I just have a major loop with a single minor loop in it which gets activated at each request.

4) Minor loops could probably improve things but I am not really looking for higher speed.

Instead I need precise deterministic points in time when the transfer is going to occur.

The thing is that the clock frequency which now has a period of 160 ns not only drives AD conversion and eDMA transfer but it also drives a sensor module.

That module produces pixel data which stabilizes after precisely ¾ of that 160 ns period. That is the exact moment where I need to activate the external ADC.

The external ADC is pipelined and produces the value read six pixels ago for most of the 160 ns period after getting the flank to sample.

I'll check out a major/minor loops + TA etc... and see what I may be able to do with it, if I can make the minor loop execute at a precise determined interval (say 128 or 160 nanoseconds or so) I could probably use it.

5) No: I have not experimented with the crossbar and bus parking but I could give it a try just to see if it makes a difference.

6) I am now using a signal with a 160 ns period and I see an extremely stable DACK1 signal being produced at exactly the same moment in time so I can time the release of DREQ1 precisely. It seems to work because I don't see any jitter whatsoever on the DACK signal or the CS. I will check out the TA logic though.

7) I definitely need a 32 bit transfer. I've got 3 AD channels with 8 bit each and it is crucial that I read them at exactly the same moment.

The three AD channels produce pixel data at exactly the same moment and I have set up the AD converters to sample at the exact same moment as well.

I only use 24 of the 32 bits but to make sure I get them all at the same point in time I need to read those three 8 bit channels simultaneously, only a 32 bit transfer can guarantee that.

8) You are right: it is pretty close to the wind but before I started I checked it out and there is nothing else using the eDMA, not a single channel is activated.

As to network traffic: it is a dedicated Ethernet link between server and client, the Ethernet peripheral is used but the link is free of any traffic except for the UDP packets I intend to send to the client.

I am going to do that each and every time a full line of pixels has been scanned. At that point I stop the scanning, process the data, assemble the packets, send them and once they are gone I repeat the process. There will be incoming UDP packets for setup etc... but that will only happen before any scanning is activated.

As to something blocking or delaying the DMA transfer: if that occurs I have a serious problem because my pixel data is only there for 160/128 ns. If there is a delay I miss it altogether.

Thanks for your suggestions. Because of them I have managed to make something that actually appears to work.

If I can improve it I won't hesitate to do so as long as I can get precise, deterministic and reproducible timing to schedule voltage generation, AD conversion and data transfer.

For now positioning the various PWM signals relative to each other and using the eDMA module seems to do the trick but If I can't have that sort of timing I will need to think of some other method to achieve it.

Best regards,

Filip Dossche

eDMA oddity on MCF54415

eDMA oddity on MCF54415

General