> I need to only Tx 6 bytes at 250Hz for each Tx line
I don't know how many transmit channels you have, but that's 666 us/byte or 100,000 CPU clocks for 6. In order for the service routine to take 1% of the CPU it would have to take 1000 clocks. I'd be surprised if it took 100. Looking at my ISR, it takes about 31 instructions.
I don't know where your code is spending it's time, but unless something is badly wrong, it isn't spending it in that code. Maybe you have somewhere in the code that it is looping waiting for the previous data to be sent that is being counted as "busy" instead of "idle"?
The first question should be "do you need to make it faster?". If there's still some idle time, why does it need to be more efficient?
Assuming you do need it to be more efficient, you need to find out what functions are really taking the time.
> I do have profiling built into the code where I can measure how much CPU is busy vs idle.
Since I'm Australian, I can quote Crocodile Dundee and say "That's not Profiling. THIS is Profiling!":
801121c2 <uart_isr1>:
{
0: 801121c2: 4e56 ffec linkw %fp,#-20
12102: 801121c6: 48d7 0307 moveml %d0-%d2/%a0-%a1,%sp@
INTERRUPT_ENTRY_TIMER;
468: 801121ca: 2039 2000 07d0 movel 200007d0 <t_stop_exit>,%d0
808: 801121d0: 660c bnes 801121de <uart_isr1+0x1c>
2: 801121d2: 2039 4000 044c movel 4000044c <IPSBAR+0x44c>,%d0
43: 801121d8: 23c0 2000 07d0 movel %d0,200007d0 <t_stop_exit>
uint8_t usr = *(p->pUsr);
79: 801121de: 2079 801c b510 moveal 801cb510 <uart.lto_priv.493+0x3c>,%a0
382: 801121e4: 1410 moveb %a0@,%d2
if ( usr & MCF_UART_USR_RXRDY &&
10877: 801121e6: 0802 0000 btst #0,%d2
1: 801121ea: 6720 beqs 8011220c <uart_isr1+0x4a>
p->imr & MCF_UART_UIMR_RXRDY_FU && // rxing
0: 801121ec: 1039 801c b518 moveb 801cb518 <uart.lto_priv.493+0x44>,%d0
if ( usr & MCF_UART_USR_RXRDY &&
0: 801121f2: 0800 0001 btst #1,%d0
0: 801121f6: 6714 beqs 8011220c <uart_isr1+0x4a>
p->rx_callback ) // callback valid
0: 801121f8: 2079 801c b50c moveal 801cb50c <uart.lto_priv.493+0x38>,%a0
p->imr & MCF_UART_UIMR_RXRDY_FU && // rxing
0: 801121fe: 4a88 tstl %a0
0: 80112200: 670a beqs 8011220c <uart_isr1+0x4a>
p->rx_callback(p->interface);
0: 80112202: 2f39 801c b4f8 movel 801cb4f8 <uart.lto_priv.493+0x24>,%sp@-
0: 80112208: 4e90 jsr %a0@
0: 8011220a: 588f addql #4,%sp
if ( usr & MCF_UART_USR_TXRDY && // tx buf empty
461: 8011220c: 44c2 movew %d2,%ccr
1: 8011220e: 663c bnes 8011224c <uart_isr1+0x8a>
p->imr & MCF_UART_UIMR_TXRDY ) // txing
551: 80112210: 1039 801c b518 moveb 801cb518 <uart.lto_priv.493+0x44>,%d0
if ( usr & MCF_UART_USR_TXRDY && // tx buf empty
194: 80112216: 0800 0000 btst #0,%d0
287: 8011221a: 6730 beqs 8011224c <uart_isr1+0x8a>
if (p->ppBuf)
2: 8011221c: 2079 801c b504 moveal 801cb504 <uart.lto_priv.493+0x30>,%a0
1074: 80112222: 4a88 tstl %a0
87: 80112224: 6766 beqs 8011228c <uart_isr1+0xca>
if (*p->pLen)
5: 80112226: 2279 801c b500 moveal 801cb500 <uart.lto_priv.493+0x2c>,%a1
132: 8011222c: 4a91 tstl %a1@
719: 8011222e: 6726 beqs 80112256 <uart_isr1+0x94>
*p->pUtb = **p->ppBuf;
9: 80112230: 2050 moveal %a0@,%a0
450: 80112232: 1010 moveb %a0@,%d0
2899: 80112234: 2079 801c b514 moveal 801cb514 <uart.lto_priv.493+0x40>,%a0
279: 8011223a: 1080 moveb %d0,%a0@
*p->ppBuf += 1;
437: 8011223c: 2079 801c b504 moveal 801cb504 <uart.lto_priv.493+0x30>,%a0
8290: 80112242: 5290 addql #1,%a0@
*p->pLen -= 1;;
129: 80112244: 2079 801c b500 moveal 801cb500 <uart.lto_priv.493+0x2c>,%a0
609: 8011224a: 5390 subql #1,%a0@
}
920: 8011224c: 4cee 0307 ffec moveml %fp@(-20),%d0-%d2/%a0-%a1
1705: 80112252: 4e5e unlk %fp
2182: 80112254: 4e73 rteThe first field is the count of how many times the previous instruction was sampled by the profiling interrupt. The "busiest" instruction is supposedly the "link" one at the top (12102), but that is counting all of the interrupt starting overhead as well, and so doesn't really count. The next most busy one (10877) is the one reading the UART status register. I TOLD you these reads were slow! In total (measured separately), the above, transmitting at over 20,000 bytes/second takes 2% of the CPU's time. So your code, sending 1500 bytes/second should be taking only 0.15% of the CPU per transmit channel.
The other problem with DMA is you can only transmit on one channel at once. With interrupts you can have them all transmitting at the same time. That might or might not matter in your case.
Tom