Are you writing 512 byte blocks? Because here is the killer. Each time you write a 512 byte logical sector, the flash card needs to completely erase and re-write a "block" (which varies in size depnding on the card size, mfg etc). Add to that internal wear leveling code and you can have very slow write times.
If you can write bigger chunks, it will seem faster. It might also help to pre-allocate a large file size so no cluster allocation needs to happen.