Hi,
I am looking for an example code to understand how to use DMA on eTPU UARTs for MCF5233 MPU. There are couple of things that I don`t understand when it comes to how to make use of the logically OR`d DMA requests on MCF5233 processor. I am hoping that an example code would be helpful to set up multiple eTPU UARTs to Tx and Rx without any issues.
Any help / explanation is appreciated.
Thank you
Indula
Hi Mike, @Hui_Ma
I was able to get the eTPU UART DMA working (sort of). I have set up one tpu line (22) to Tx using DMA and it works until it isn`t.
My application has interrupts enabled for SCI UART 0 and other eTPU lines that are used for different functions. (ex: PPA, GPIO etc)
for some reason at random times (but always), the DMA causes a hard fault (this is what it seems like) and everything hangs.
DO I need to make sure my DMA (channel 3) source buffer is located at SRAM ?(I believe it should be there given that I had to enable back-door access to SRAM for DMA module in RAMBAR)
I am using default Internal Bus Arbitration settings where priority is changed in round-robin fashion. Does this needs to be fixed to the initial setting where DMA has a higher priority over CPU ?
Is there anything that you could think of DMA to cause a crash apart from what I mentioned above ?
My initialization is pretty simple,
void DMAinit(void)
{
MCF_SCM_RAMBAR = 0x30000200U;
MCF_SCM_GPACR = MCF_SCM_GPACR_ACCESS_CTRL(0xC);
MCF_SCM_PACR1 = MCF_SCM_PACR1_ACCESS_CTRL0(6U);
MCF_DMA_DSR(3U) = MCF_DMA_DSR_DONE; /* reset DMA status register */
MCF_SCM_DMAREQC = MCF_SCM_DMAREQC_DMAREQC_EXT(8U); /* bit DMAREQC_EXT[3] controls the eTPU requests */
(void)enable_usr_isr(MP2128_DMA3_INT_VECTOR, (uint16_t)0U, MP2128_DMA3_INT_SOURCE, MP2128_DMA3_INT_LEVEL, MP2128_DMA3_INT_PRIORITY, (VFNPTR)(mp2128_DMA3_int));
MCF_DMA_DCR(3U) = (0
| MCF_DMA_DCR_INT /* interrupt on completion of transfer */
//| MCF_DMA_DCR_EEXT /* enable external request. We are setting this later to start the DMA */
| MCF_DMA_DCR_D_REQ /* disable EEXT after the transfer */
| MCF_DMA_DCR_CS /* single RW per request */
//| MCF_DMA_DCR_SSIZE(1) /* size of source */
//| MCF_DMA_DCR_DSIZE(1) /* size of destination */
| MCF_DMA_DCR_SINC /* DMA source increment */
//| MCF_DMA_DCR_SMOD(MCF_DMA_DCR_SMOD_16)
);
}
another function is called at a given rate (in my testing at 100Hz) to send a 6 Byte message through eTPU UART.
I set,
MCF_DMA_SAR(3U) = (uint32_t)(&testBuff[0U]);
MCF_DMA_DAR(3U) = (uint32_t)(pba + (((FS_ETPU_UART_TX_RX_DATA_OFFSET - 1U) >> 2U)));
MCF_DMA_BCR(3U) = (uint32_t)sizeof(testBuff);
MCF_DMA_DCR(3U) |= MCF_DMA_DCR_EEXT; /* This is my controller to start and stop the block transfres */
Finally, ISR for DMA channel 3 clears the DONE flag in DSR(3U) as follows.
MCF_DMA_DSR(3U) |= MCF_DMA_DSR_DONE; /* clear transfer complete */
I don`t understand what is causing the CPU to crash at random times.
Before the crash, I am getting messages at the rate defined correctly.
One other thing that I noticed is that, the crash happens at the beginning of a transfer. (or could be the end also)
(i.e., my com log stops at the last character on the testBuff)
Any help is appreciated.
Please feel free to forward / tag this to any of your college who you might think that has a better insight.
Thank you
Regards
Indula
Concerning some of your notes, you should probably try to keep all of the DMA data in SRAM if you can. It is faster and likely to cause less problems. You can also make sure you don't have the Data Cache enabled for that area and then you don't have to worry about flushing the cache. You should always enable the "back door", but it only needs to be used if the ADDRESSES you are using for the DMA are in the SRAM. The addresses used determine the memory and that gives the requirement for the back door for external master access to the SRAM.
I've played with bus arbitration. It makes very little difference in practice. It is very hard to measure any differences. It shouldn't matter.
Make sure BME is set in the CCR. If your DMA controller (or the CPU) tries to address memory that doesn't exist, it will hang forever unless the "Bus Monitor" is enabled to terminate these cycles.
I have a few suggestions, but note I haven't read through your initialisation to see if I can spot any errors. I'd suggest you read the DMA and Interrupt chapters a few times to check your logic.
First, don't poll any DMA registers. Leave it alone until it interrupts.
Second (disobeying the first rule :-), poll the "DRS3" register to see if it is reporting any errors. If doing that makes things worse, then follow the first rule. Check the DSRs before and after transfers. Check ALL the DMA registers for expected content.
Third, there's one big trap with all the MCF52xx chips that everybody falls into:
13.2.1.6 Interrupt Control Register (ICRnx, (x = 1, 2,..., 63))
Note: It is the responsibility of the software to program the ICRnx
registers with unique and non-overlapping level and priority definitions.
Failure to program the ICRnx registers in this manner can result in
undefined behavior. If a specific interrupt request is completely unused,
the ICRnx value can remain in its reset (and disabled) state.
Read the above carefully. That means that if you have two (or more) devices requesting on the same LEVEL (on the same controller - there are two of these) then they must have different PRIORITIES. If you have 9 devices on the same level on the same controller - well you just can't have that. You have to move some of them to different levels. Getting this wrong causes nasty intermittent problems.
Searching for "overlapping levels and priority definitions" in this forum finds the following that complains that the NXP Demo code doesn't follow these requirements, and is unreliable as a result:
Here's another one from nearly 11 years ago, including a warning about a whole missing section in the manual that was fixed in 2015:
Tom
Hi,
There is an application note AN2168 about DMA module.
Please refer page 18 sector DMA and Bus Prioritization for your reference.
Wish it helps.
Mike
Hi Mike, @Hui_Ma
Thank you for the document. I will go through it and see if missed anything.
Also, on a quick glance, I was that the examples given in that document uses chip select registers along with DMA configuration registers to configure transfers. How do I determine if this is something that I need to do as well?
I am not specifically configuring any chip select registers, in my code. But all the memory is properly mapped in the linker and it seems to be working without it. I was just wondering if this can cause delay etc.
Thank you again for the quick responses.
~Indula
Hi,
The AN2168 just give an example about how DMA bus cycle looks like. It used DMA working with DRAMC (DRAM control), then the logic analyzer could capture the external DRAM memory timing to show how DMA bus cycle looks like. If you are using on-chip RAM, there does not need to set any chip select.
Thanks for the attention.
Mike
Thanks a lot Mike @Hui_Ma and Tom @TomE for the explanation. It makes sense now.
I also found the issue for the random crashes. It was me not paying attention to previous code that already was working with interrupts. Although I had thought I had IF def`d out all the previous code, I had accidently left some parts enabled causing those crashes.
I have fixed that and no more crashing with DMA now. The reason that I wanted to try DMA for eTPU UARTs to try and improve CPU performance. With interrupts, the CPU gets pretty loaded with multiple eTPU UARTs and the rest of the application code. But when I measured the execution times to evaluate the performance I ended up having the same figures with DMA compared to interrupts.
As Tom mentioned, I made sure BDE and SPV bits in both RAMBAR are enabled and located the eTPU DMA source buffer (since I am testing Tx for now) on SRAM. Also the DMA channel 3 interrupt has it`s unique interrupt level and priority. I also played with MPARK register to fix the external bus priority access to give DMA an edge over the CPU. Despite all the changes, I am not seeing any drastic improvements.
I am going to try executing the DMA related code from SRAM rather than from SDRAM to see if that makes a difference.
If you have any other suggestions to possibly improve the DMA performance, please feel free to share with me.
Thank you
~Indula
> Despite all the changes, I am not seeing any drastic improvements.
That usually means you're making code faster where the CPU isn't spending much time anyway. I remember the story of someone who spent a week optimising a function. Over a year later it ran for the first time and crashed because it was seriously buggy.
That means you have to MEASURE and preferably PROFILE your code. I have profiling built into the code I support (as a Makefile option for testing). I have a timer interrupt at Level 7 (NMI) at 13us. The service routine extracts the interrupted program counter from the stack and uses it as an index into a big 32-bit integer array the same size as the code. It increments that location. It does that for a minute or so, and then dumps the array. So that captures how many times each instruction was executed. I then combine that data with the symbol table and get a list of how much time it spent in each function. When I know the problematic functions, I can interleave the counts with a disassembly. That usually shows me which loops in which functions are taking the most time.
A simpler approach is to configure a spare 32-bit DMA Timer to free-run at 1MHz or 10MHz. In a piece of code you want to measure (like the UART interrupt routine), read the running timer value at the start of the function, at the end, subtract the two and add to an accumulator. Then once per second (or whatever), read and print and clear that counter. That tells you how many microseconds per second that function took. You can also save how long one run took. You can add a bunch of different counters and measure the execution time of a lot of different functions at once.
There's so much overhead loading up all the DMA Controller registers that you have to be transmitting a lot of data (20 or more bytes) to make it worthwhile over the small number of UART register writes needed in an interrupt routine for non-DMA handling. But unless you're running the serial ports at megabit rates, normal interrupts shouldn't take much time. For instance I've measured our code sending at 230,400 bits/sec. That's only one interrupt per 43us or 6510 CPU clocks/instructions. The code was using inefficient multiple function callbacks from the interrupt service routine and was taking 4.3% of the CPU. After a simplifying rewrite it now takes only 2.2%. Here's an example of the profiling:
can_comm 28332 1.22% qspi_isr.lto_priv.585 33193 1.43% t2.part.2 36818 1.59% uart_isr1 51369 2.22% <<<<<<<<<< This one vimcom_cycle 62968 2.72% adc_isr.lto_priv.597 275740 11.94% main_proc_loop.lto_priv.598 1622792 70.32%
Controller registers (Uart, TPU or DMA) typically take 10 CPU clocks or more for one read or write. Remember that. That's a "Dirty Secret" that the data books never mention. So minimise controller register reads and writes. Unless the Cache is off and then they're probably faster than your SDRAM.
> I am going to try executing the DMA related code from SRAM rather than from SDRAM
That shouldn't make too much difference as long as you have the Instruction Cache turned on. You DO have that on I hope? If you don't, then you don't have a 150MHz CPU. You have a 20MHz CPU. Maybe only 10MHz if you have a 16 bit SDRAM data bus. Read my second reply here:
https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/MCF5234-QSPI-via-DMA/m-p/1533682#M14284
Make sure you have the CPU Stack in SRAM, Also commonly accessed data if you can. I've measured an almost doubling of speed for a thread with stack in SRAM versus SDRAM (413ms, down from 732ms).
You also want the Data Cache enabled. It makes a difference too. But then you either want a block of SDRAM programmed as"uncached" or have all your DMA Buffers (DMA, Ethernet, USB) in SRAM. Your chip doesn't have Ethernet or USB, so you don't have to worry about that. If you want cached data and DMA buffers in SDRAM you need to cache-flush buffers before use.
With our code (on 150MHz MCF5235), the CPU was 47% "busy" with the 8k instruction cache. Changing to split 4k instruction and data changed the CPU to only being 27% "busy". That's a lot!
Tom
Hi Tom, @TomE
Thank you for the detailed explanation and for the suggestions.
I do have profiling built into the code where I can measure how much CPU is busy vs idle. I was getting the same idle time for enabling DMA for 2 UART Tx lines compared to the old code using interrupts for that. That why I said I don`t see any improvements. I need to only Tx 6 bytes at 250Hz for each Tx line. may be that`s the reason why (as you`ve said) the DMA does not have any advantage over interrupts.
The biggest issue is, since MCF523x, only has one DMA channel for eTPU related work and when you use DMA for more than one eTPU UART, you have to reprogram the registers to the new line after each successful transfer. That could be a considerable amount of overhead.
"That shouldn't make too much difference as long as you have the Instruction Cache turned on"
- You are correct, I did not see any improvement at all after moving the functions to execute from SRAM. I have the split cache instructions / data option enabled. (i.e., CACR is set to 0x80000000 using movec)
I am already running -O3 optimizations on code and may be this is the fastest that I can get unless I can somehow improve the ISR logic to get some time savings there or improve the logic flow of the protocol that I am executing.
Thank you
~Indula
> I need to only Tx 6 bytes at 250Hz for each Tx line
I don't know how many transmit channels you have, but that's 666 us/byte or 100,000 CPU clocks for 6. In order for the service routine to take 1% of the CPU it would have to take 1000 clocks. I'd be surprised if it took 100. Looking at my ISR, it takes about 31 instructions.
I don't know where your code is spending it's time, but unless something is badly wrong, it isn't spending it in that code. Maybe you have somewhere in the code that it is looping waiting for the previous data to be sent that is being counted as "busy" instead of "idle"?
The first question should be "do you need to make it faster?". If there's still some idle time, why does it need to be more efficient?
Assuming you do need it to be more efficient, you need to find out what functions are really taking the time.
> I do have profiling built into the code where I can measure how much CPU is busy vs idle.
Since I'm Australian, I can quote Crocodile Dundee and say "That's not Profiling. THIS is Profiling!":
801121c2 <uart_isr1>: { 0: 801121c2: 4e56 ffec linkw %fp,#-20 12102: 801121c6: 48d7 0307 moveml %d0-%d2/%a0-%a1,%sp@ INTERRUPT_ENTRY_TIMER; 468: 801121ca: 2039 2000 07d0 movel 200007d0 <t_stop_exit>,%d0 808: 801121d0: 660c bnes 801121de <uart_isr1+0x1c> 2: 801121d2: 2039 4000 044c movel 4000044c <IPSBAR+0x44c>,%d0 43: 801121d8: 23c0 2000 07d0 movel %d0,200007d0 <t_stop_exit> uint8_t usr = *(p->pUsr); 79: 801121de: 2079 801c b510 moveal 801cb510 <uart.lto_priv.493+0x3c>,%a0 382: 801121e4: 1410 moveb %a0@,%d2 if ( usr & MCF_UART_USR_RXRDY && 10877: 801121e6: 0802 0000 btst #0,%d2 1: 801121ea: 6720 beqs 8011220c <uart_isr1+0x4a> p->imr & MCF_UART_UIMR_RXRDY_FU && // rxing 0: 801121ec: 1039 801c b518 moveb 801cb518 <uart.lto_priv.493+0x44>,%d0 if ( usr & MCF_UART_USR_RXRDY && 0: 801121f2: 0800 0001 btst #1,%d0 0: 801121f6: 6714 beqs 8011220c <uart_isr1+0x4a> p->rx_callback ) // callback valid 0: 801121f8: 2079 801c b50c moveal 801cb50c <uart.lto_priv.493+0x38>,%a0 p->imr & MCF_UART_UIMR_RXRDY_FU && // rxing 0: 801121fe: 4a88 tstl %a0 0: 80112200: 670a beqs 8011220c <uart_isr1+0x4a> p->rx_callback(p->interface); 0: 80112202: 2f39 801c b4f8 movel 801cb4f8 <uart.lto_priv.493+0x24>,%sp@- 0: 80112208: 4e90 jsr %a0@ 0: 8011220a: 588f addql #4,%sp if ( usr & MCF_UART_USR_TXRDY && // tx buf empty 461: 8011220c: 44c2 movew %d2,%ccr 1: 8011220e: 663c bnes 8011224c <uart_isr1+0x8a> p->imr & MCF_UART_UIMR_TXRDY ) // txing 551: 80112210: 1039 801c b518 moveb 801cb518 <uart.lto_priv.493+0x44>,%d0 if ( usr & MCF_UART_USR_TXRDY && // tx buf empty 194: 80112216: 0800 0000 btst #0,%d0 287: 8011221a: 6730 beqs 8011224c <uart_isr1+0x8a> if (p->ppBuf) 2: 8011221c: 2079 801c b504 moveal 801cb504 <uart.lto_priv.493+0x30>,%a0 1074: 80112222: 4a88 tstl %a0 87: 80112224: 6766 beqs 8011228c <uart_isr1+0xca> if (*p->pLen) 5: 80112226: 2279 801c b500 moveal 801cb500 <uart.lto_priv.493+0x2c>,%a1 132: 8011222c: 4a91 tstl %a1@ 719: 8011222e: 6726 beqs 80112256 <uart_isr1+0x94> *p->pUtb = **p->ppBuf; 9: 80112230: 2050 moveal %a0@,%a0 450: 80112232: 1010 moveb %a0@,%d0 2899: 80112234: 2079 801c b514 moveal 801cb514 <uart.lto_priv.493+0x40>,%a0 279: 8011223a: 1080 moveb %d0,%a0@ *p->ppBuf += 1; 437: 8011223c: 2079 801c b504 moveal 801cb504 <uart.lto_priv.493+0x30>,%a0 8290: 80112242: 5290 addql #1,%a0@ *p->pLen -= 1;; 129: 80112244: 2079 801c b500 moveal 801cb500 <uart.lto_priv.493+0x2c>,%a0 609: 8011224a: 5390 subql #1,%a0@ } 920: 8011224c: 4cee 0307 ffec moveml %fp@(-20),%d0-%d2/%a0-%a1 1705: 80112252: 4e5e unlk %fp 2182: 80112254: 4e73 rte
The first field is the count of how many times the previous instruction was sampled by the profiling interrupt. The "busiest" instruction is supposedly the "link" one at the top (12102), but that is counting all of the interrupt starting overhead as well, and so doesn't really count. The next most busy one (10877) is the one reading the UART status register. I TOLD you these reads were slow! In total (measured separately), the above, transmitting at over 20,000 bytes/second takes 2% of the CPU's time. So your code, sending 1500 bytes/second should be taking only 0.15% of the CPU per transmit channel.
The other problem with DMA is you can only transmit on one channel at once. With interrupts you can have them all transmitting at the same time. That might or might not matter in your case.
Tom