RGPIOs access from DMA

eukeniurkiaga · ‎03-23-2017

Hi,

i am working with MCF54418. I need to set a GPIO when i detect the fifth falling edge in other signal. First I tried to change the GPIO_PPDSDR from the DMA (using the DTIM0 and DT0IN). It works but it is slower than i need so i decided to use RGPIOS. So I try to change the RGPIO_DATA (0x8C000002) from the DMA (DTIM0), but, this does not work, and there is no feedback info in EDMA error status register. I have tried to change the RGPIO_DATA register from interrupt instead of DMA and it works but it is not so fast (2us) as i need.

My question is: Are the RGPIOs registers accessible from the DMA?I have done many,many tests and it seems they are not accesible forma DMA, it seems they are only accessible from the code or the interrups . Or, do i need to change any other register?

thx in advance.

TomE · ‎03-23-2017

The Block Diagram shows the RGPIO "inside" the CPU block. That implies that DMA can't access it.

The documentation for the RGPIOBAR register states for the "V" bit that "Processor accesses of the RGPIO are enabled". The only other control bit determines if user-space access is allowed. That can be assumed to be Processor User Space access. So that seems to say "no".

Check the documentation for the SRAM. It is Dual Ported and has to have a dedicated "back door" to allow access from peripherals via a port on the Crossbar Switch. The RGPIO doesn't have this.

Look at the diagram of the Crossbar Switch in section "14.1 Overview". There are "Master" and "Slave" ports. Only Masters can access Slaves. The SRAM "backdoor" is a Slave port. The RGPIO device is on the Master side of that Crossbar, so it is impossible for any other devices to access it.

If you've ever timed the DMAC you'll probably find it is quite slow when compared with the CPU. So If you're finding it slow when accessing the GPIO registers, it is probably more to do with the DMAC than the GPIO access. Probably both.

Register access on these CPUs is very slow. Here are previous posts showing the V2 CPUs take 12 clocks and implying the V4 ones (your one) are worse than that:

https://community.nxp.com/message/62228#62228

https://community.nxp.com/message/307212#307212

https://community.freescale.com/message/42042#42042

http://permalink.gmane.org/gmane.comp.hardware.motorola.microcontrollers.coldfire/8056

And since gmane is not "permanent" any more, it is lucky that I copied the relevant text into two of the above posts.

Programming the DMAC is going to hit the same "slow register access problems" as getting to anything else.

> I have tried to change the RGPIO_DATA register from interrupt instead of DMA and it

> works but it is not so fast (2us) as i need.

This is a 250MHz CPU. If your interrupt service routine it taking longer than 500 CPU instructions to take the interrupt and write to one register then you've got something badly wrong there. The code is either stunningly inefficient, or the CPU is limping along at a fraction of its proper rate. So which one it is? Is the CPU running properly and at the right clock rate? Is the Cache enabled? What RAM are you using, internal or external? Where's the CPU Stack? It should be in the internal 64k Static RAM. That makes interrupts run faster.

You should be able to get from the interrupt to writing the pin in less than 50 instructions which should be able to execute in slightly more (due to cache loading) than 50 clocks, or 200 nanoseconds.

You should be using multi-level interrupts with the one you need to be fast at a higher CPU priority (say Level 6) than any of the other ones, and all of the rest of the system should be set up so that interrupt routines can interrupt other ones.

Are you running Linux on it? Give up any thought of "real time" if you are. Linux doesn't usually understand or support multi-level interrupts either.

If the "5th falling edge" is know to happen at a pretty constant time after the fourth edge, then you can take the interrupt on the 4th one (or 3rd or the rising edge before the 5th falling one) and then just sit in the interrupt routine, polling for that 5th falling edge. You'll be able to write to the RGPIO pin in two or three clocks (at 250MHz) after detecting that edge. Which should be connected to an RGPIO Input pin. If you can't hack that latency, then make that a low-priority interrupt so more important ones can interrupt it. But you then have to guarantee all higher priority interrupts will all complete in less than 2us.

Tom

View solution in original post

eukeniurkiaga · ‎03-29-2017

Thanks Tom.

Yes, it seems that the delay is due to MQX. As I have seen from AN4254.pdf:

"..a standard MQX interrupt is possible for applications where the additional delay of MQX interrupt processing (typically 2 μs depending on the CPU speed) is not critical..."

I will try with the Kernel ISR.

Thx. Your help is greatly apreciated.

TanT

eukeniurkiaga · ‎03-27-2017

Wow!! Thank you very much Tom!

Now I Know can not Access RGPIOS from DMA. I think that it should be written anywhere in the user manual.

Yes, we are copying code from NandFlash to DDR2 RAM. The CPU stack is in the internal static RAM, the cache is enabled and i think the CPU is running prperly at the right clock rate. I will try to investigate if there is any problem with the interrupt priorities, although I set the interrupt priority to 6.

Jus for comment, I measured the 2us with an oscilloscope between the fifth falling edge and the RGPIO. And I set the RGPIO in the first instruction (MQX) of the DTIM0 interrupt routine. If I have correctly unsdestood , in your opinon I should be able to set the RGPIO in 200ns (aprox), so something shoud be wrong in my system. I will do more tests.

thx!

TomE · ‎03-27-2017

The delay may be due to MQX, in which case you should be searching the MQX forum.

Or just ask Google for "MQX interrupt latency" and find:

https://community.nxp.com/thread/107824

Which recommends "And then installing interrupt routine as kernel-ISR".

Google also found:

"Motor Control Under the Freescale MQX Operating System":

http://www.nxp.com/assets/documents/data/en/application-notes/AN4254.pdf

Tom

TomE · ‎03-27-2017

> And I set the RGPIO in the first instruction (MQX) of the DTIM0 interrupt routine.

OK, so you're running MQX. You'll have to find out how much it is getting in the way. You'll have to read up on it to see if the documentation tells you what it is doing. I can't help as I've never used it. Does MQX come with full sources? You may have to track through all the code manually to see where it might be going and what it is doing. It may be doing some complicated MMU context-switch operation.

Otherwise, you should do a few basic checks to see if the CPU is running like you think it is. Measure FB_CLK and make sure it is 62.5 MHz. Write some simple loops and then check the disassembly to see how many instructions there are in the loop. Then run the code and time it. Either make it long enough that you can measure it with a stopwatch, have it time itself of one of the counters (setting up a 32-bit timer to run at 1MHz is always useful), or clip an RGPIO bit over at the end of a loop and measure it with an oscilloscope. Watch out that a smart compiler will optimise a for-loop down to nothing if the loop doesn't actually do anything (have any output) of if it can compute the output. That's why you should read the assembly code and count instructions.

The important thing is to work out how fast the CPU should be executing instructions at 250MHz (The CPU is Superscalar, so more than 250M unless you execute a divide), and to then MEASURE how long it is actually taking. If these don't match up, keep writing more tests until you get an explanation of what is really happening.

> The CPU stack is in the internal static RAM, the

If you're running multiple threads in MQX, are all the thread stacks in SRAM or might some of them be external?

You should be able to find the initial vector for the interrupt you're generating. That will probably call into some MQX Assembly code handling the interrupt and performing OS operations before eventually calling the interrupt routine that your code registered. It might be possible to hook or add some fast code to that assembly to do the dedicated RGPIO operation. That should speed the response up a lot.

Here's a fun test. Write a tight loop that simply turns a GPIO pin on and off via the GPIO. Then repeat using the RGPIO. The latter claims "zero wait" so in theory you should be able to get nearly 250MHz between edges. Then again, the Manual says "Package pin toggle rates typically 1.5–3.5x faster" that GPIO, but from my experience the latter should be a LOT slower than that. Here's someone who has done that on this chip and measured 16ns between RGPIO toggles, so you should be able to get the same:

https://community.nxp.com/message/327943

Here's some more "speed stuff":

https://community.nxp.com/message/307212?commentID=307212#comment-307212

How are you capturing "the 4th edge" using a DMA Timer? Are you clocking the DMA Timer with the external signal and counting up to a match value? That might result in the external clock being used to drive the internal state machine, and it may take multiple clocks before that machine works its way through the logic to raise the interrupt.

There are problems with the Manuals. You need BOTH Revisions 3 and 4. Rev 3 has the correct DMA controller chapter, which got broken in Rev 4 (and never fixed). The DMA Timer chapters have problems apparently from this post (from nearly 5 years ago):

https://community.nxp.com/message/307125?commentID=307125#comment-307125

Tom

TomE · ‎03-23-2017

The Block Diagram shows the RGPIO "inside" the CPU block. That implies that DMA can't access it.

The documentation for the RGPIOBAR register states for the "V" bit that "Processor accesses of the RGPIO are enabled". The only other control bit determines if user-space access is allowed. That can be assumed to be Processor User Space access. So that seems to say "no".

Check the documentation for the SRAM. It is Dual Ported and has to have a dedicated "back door" to allow access from peripherals via a port on the Crossbar Switch. The RGPIO doesn't have this.

Look at the diagram of the Crossbar Switch in section "14.1 Overview". There are "Master" and "Slave" ports. Only Masters can access Slaves. The SRAM "backdoor" is a Slave port. The RGPIO device is on the Master side of that Crossbar, so it is impossible for any other devices to access it.

If you've ever timed the DMAC you'll probably find it is quite slow when compared with the CPU. So If you're finding it slow when accessing the GPIO registers, it is probably more to do with the DMAC than the GPIO access. Probably both.

Register access on these CPUs is very slow. Here are previous posts showing the V2 CPUs take 12 clocks and implying the V4 ones (your one) are worse than that:

https://community.nxp.com/message/62228#62228

https://community.nxp.com/message/307212#307212

https://community.freescale.com/message/42042#42042

http://permalink.gmane.org/gmane.comp.hardware.motorola.microcontrollers.coldfire/8056

And since gmane is not "permanent" any more, it is lucky that I copied the relevant text into two of the above posts.

Programming the DMAC is going to hit the same "slow register access problems" as getting to anything else.

> I have tried to change the RGPIO_DATA register from interrupt instead of DMA and it

> works but it is not so fast (2us) as i need.

This is a 250MHz CPU. If your interrupt service routine it taking longer than 500 CPU instructions to take the interrupt and write to one register then you've got something badly wrong there. The code is either stunningly inefficient, or the CPU is limping along at a fraction of its proper rate. So which one it is? Is the CPU running properly and at the right clock rate? Is the Cache enabled? What RAM are you using, internal or external? Where's the CPU Stack? It should be in the internal 64k Static RAM. That makes interrupts run faster.

You should be able to get from the interrupt to writing the pin in less than 50 instructions which should be able to execute in slightly more (due to cache loading) than 50 clocks, or 200 nanoseconds.

You should be using multi-level interrupts with the one you need to be fast at a higher CPU priority (say Level 6) than any of the other ones, and all of the rest of the system should be set up so that interrupt routines can interrupt other ones.

Are you running Linux on it? Give up any thought of "real time" if you are. Linux doesn't usually understand or support multi-level interrupts either.

If the "5th falling edge" is know to happen at a pretty constant time after the fourth edge, then you can take the interrupt on the 4th one (or 3rd or the rising edge before the 5th falling one) and then just sit in the interrupt routine, polling for that 5th falling edge. You'll be able to write to the RGPIO pin in two or three clocks (at 250MHz) after detecting that edge. Which should be connected to an RGPIO Input pin. If you can't hack that latency, then make that a low-priority interrupt so more important ones can interrupt it. But you then have to guarantee all higher priority interrupts will all complete in less than 2us.

Tom

TomE · ‎03-26-2017

> Is the CPU running properly and at the right clock rate? Is the Cache enabled?

What are you storing the code in? If you're running from something (NOR Flash?) on the Flexbus, then the maximum speed that can run at is 100MHz according to the Reference Manual, or 62.5MHz according to the Data Sheet. I believer the latter as you can only run it at the CPU Clock divided by 2 or 4, and 250MHz/4 is 62.5MHz. At a minimum of 4 clocks per read, that's a 15.6MHz data rate. If you don't have the instruction cache enabled, that's the speed the 250MHz CPU will be running at. If you do have the instruction cache enabled, it will still take that long (4 cycles or 256us or 64 CPU instruction times) just to load ONE cache line after a cache miss. Double those figures if you have 16-bit wide memory on the FlexBus.

The previous paragraph assumes 6ns Flash, which doesn't exist. Assuming 100ns Flash, you'll need 7 wait states, meaning 11 16ns clocks per read, or 176ns, or 5.7MHz or 44 CPU clocks. Quadruple that for a cache line read on a 32-bit bus, or octuple it on a 16 bit bus.

Generally the fastest way to run these CPUs is to copy all the code from Flash into DDR2 RAM and have that cached. DDR2 can support faster cache line reloads than anything other than having the code small enough to run from the 64k SRAM. Then you have to worry about the data read and write speed, but turning the data cache on is more complicated because you need to handle all the DMA peripherals.

Tom