A couple notes/comments:
1. Are you using fixed point functions in CMSIS? or Floating point?. The code setup in MonkeyJam is 100% fixed point. You don't have to select any floating point settings. In your instructions, there is assumption that one will use the floating point functions and there is no hardware floating point available
2.) Some of the the cortex M4s have hardware floating point. Selecting software floating in the setup make CMSIS slow. The functions will work but there is no acceleration. If you have an M4 without an FPU, I would highly recommend using fixed point. The CMSIS code will use the actually high precision MAC instructions.