Well, I am not an assembly language expert, but yes as you can see from the code, you need to keep sending ff until you get back status. Unless you are clocking the card, nothing happens. Also pay close attention to the select line. You must not de-select during a command. Also after you de-select, the card need more clocks.
The select line acts like a command reset, so you must de-select between commands.
Even if you do not use "C" that code will show you the over all flow of talking to the card.
I also recommend you carefully study chans fat code. There are many man hours of effort there, and while perhaps you are very smart, so is chan and it has taken him quite some time to get this code to where it is.
You will also find that SD card access is very very slow in the context you are in because you do not have big buffers. Each time you write a sector, the card has to erase and re-write and entire block, and this take time.
The blocks vary in size, so actually you may get better performance with smaller cards.