Hello,
It's not really a response for your trouble but it may help. I don't know how all Abstraction Layers of NXP lib were constructed so it will probably always be slower than directly access the 663. PC has a lot of drivers, there is the communication with the Arm processor and after all that the Arm processor access the 663 SPI bus (in case of the Dev Board). To many steps.
I'm working with 8 bit MCU's and I can tell you that a Discover (REQA + SELECT) doesn't take more than 30ms.
If you are looking for speed you should probably try this approach.
Regards,
F. Coelho