Well, it's not just about SPI. Let me summarize the overheads I know about along with some hints:
1) RAM target (execution from RAM) => use flash target
2) higher priority tasks and interrupts => review your needs/architecture
3) MFS stack (caches, FAT read/write, single sector based processing) => set MFSCFG_FAT_CACHE_SIZE to 2 (minimum)
4) SD card layer (commands issuing, busy waiting, CRC calculation) => N/A
5) SPI driver (IOCTLs, CS assertions) => N/A
6) debug code overhead => use release targets
There's no doubt that SD card could be operated more efficiently (even on this processor) and there could be done some improvements, but on the other hand, it's quite an universal stack and multitasking we're talking about.
PetrM