I don't have any particular experience using SPI for an SD card, but I will make two particular comments:
All SPI transactions are simultaneously a 'read' and a 'write'. The only difference is what you 'keep'. I think you will find that the DMA to 'empty' the RX-FIFO will 'naturally' come after the requests to fill the TX-FIFO since you can't have an RX-request until a TX-send has completed.
Freescale DSPI 'DMA' is a 'little bit inconvenient'. The loading of SPIx_PUSHR FIFO registers requires 32-bit writes, the top-half of which are SPI-controls. Thus, if you want to DMA-out a 'block', you have to intersperse your data as the 'bottom byte or word' in these 32-bit words, meaning your data-block is 'non contiguous'. This puts an 'extra step' in your data-block handling to interleave data & controls that MAY preclude any advantage you were hoping to gain from DMA, especially since I assume you would run SPI to SD at 'full hardware rate' (1/2 busclk), meaning each byte-out takes 16 bus-clocks, or probably only 32 CPU cycles.
I presently use SPI DMA to continually refresh a monochrome bitmap display. It is write-only, so I ignore RX requests (and overruns therefrom). And since the internal memory buffer is in a fixed location (specifically addressed at the top of RAM so I can use bit-banding access for individual pixels!), it is 'very little trouble' to work with using only the least-byte of the 32-bit words, with the rest pre-set for the proper SPI controls for each write.
Memory structure:
typedef union{ //Byte/DWord duality, big-endian
struct{
uint8_t lo;
uint8_t mlo;
uint8_t mhi;
uint8_t hi;
} u8;
struct {
uint16_t lo;
uint16_t hi;
}u16;
uint32_t u32;
}u32_8_t;
#define Y_PITCH 132
typedef struct {
u32_8_t OLED_CMDS[32]; //Precede the actual dispaly RAM with room for commands to prefix the data
// In a contiguous block-write operation
u32_8_t Display_RAM[Y_PITCH*8]; //Chip RAM is 132*8, only 128*8 is displayed
} OLED_RAM_Obj;
//OLED.Display_RAM[ ].u8.lo are the bytes for screen data
with this pre-set:
uint16_t foo;
//Preset SPI-port-required upper bits of Command/Display RAM
for(foo=32;foo>0;foo--) //Commands assert two CS, one of which is D/!C
OLED.OLED_CMDS[foo-1].u32 = SPI_PUSHR_PCS(3) | SPI_PUSHR_CTAS(0);
for(foo=Y_PITCH*8;foo>0;foo--) //Data asserts just CS0, leaving D/!C high
OLED.Display_RAM[foo-1].u32 = SPI_PUSHR_PCS(1) | SPI_PUSHR_CTAS(0);
#pragma location=0x2000EF00
__no_init OLED_RAM_Obj OLED; //Pre-allocated fixed-space for Display RAM
//Necessary, in SRAM-U, to use bit-banding!
TX channel initialization:
void DMA_Init_Tx(void)
{
// use dma to blast-out display contents!
SIM_SCGC6 |= SIM_SCGC6_DMAMUX_MASK;
SIM_SCGC7 |= SIM_SCGC7_DMA_MASK;
DMA_ERQ = DMA_ERQ_ERQ2_MASK; //channel 2
DMAMUX_CHCFG2 = DMAMUX_CHCFG_ENBL_MASK | DMAMUX_CHCFG_SOURCE(DMA_SPI0_XMIT_CHAN) ; //Source 17 for SPI transmit
DMA_TCD2_SADDR = (uint32_t)&OLED.OLED_CMDS[32-OLED_cmd_cnt].u32; /* Set the Source Address */
/* Destination address */
DMA_TCD2_DADDR = (uint32_t)&SPI0_PUSHR;
/* Source offset Dwords */
DMA_TCD2_SOFF = 0x04;
/* Source and Destination Modulo off, source and destination size 2 = 32 bits */
DMA_TCD2_ATTR = DMA_ATTR_SSIZE(2) | DMA_ATTR_DSIZE(2);
/* Transfer 4 bytes (one aligned Dword) per transaction */
DMA_TCD2_NBYTES_MLNO = 0x04;
/* Adjust back to start needed */
DMA_TCD2_SLAST = -(4*(Y_PITCH*8 + OLED_cmd_cnt));
/* Destination offset disabled */
DMA_TCD2_DOFF = 0x00;
/* No link channel to channel, 1 transaction */
DMA_TCD2_CITER_ELINKNO = DMA_CITER_ELINKNO_CITER(((Y_PITCH*8 + OLED_cmd_cnt)));
/* No adjustment to destination address */
DMA_TCD2_DLASTSGA = 0x00;
DMA_TCD2_BITER_ELINKNO = DMA_BITER_ELINKNO_BITER(((Y_PITCH*8 + OLED_cmd_cnt)));
}
Then this is called regularly to start the DMA-block to refresh:
void OLED_Refresh(void) //See that RAM contents get sent to the OLED display device
{
// SPI0_MCR |= SPI_MCR_CLR_RXF_MASK; //Make sure RX Fifo is empty for us!
//^^leave it full of garbage, there is just more to come (which we don't even offload)
SPI0_RSER = SPI_RSER_TFFF_RE_MASK | SPI_RSER_TFFF_DIRS_MASK; //Set SPI TX to make DMA requests
DMA_ERQ = DMA_ERQ_ERQ2_MASK; //channel 2
DMA_TCD2_CSR = DMA_CSR_DREQ_MASK | DMA_CSR_START_MASK; //One transfer at a time.
}