Using semihosting is going to be slow. It works by stopping the program, transferring data to the debugger (which will then perform the operation) and (after the i/o has completed on the host) restart the application. It is not designed for real-time applications! You may want to introduce some buffering in your application so you transfer several results at once - the (relatively) slow part is transferring data to the debugger. I guess you need to experiment to see if it meets your needs.
And, of course, it will only work when a debugger is attached - your application will halt, and not restart, if there is no debugger attached. However, this will be quicker than any of the other methods you suggested!
If you need to transfer large amounts of data, or need it to be quick, it may be best to use a serial port.
Regarding CycleDelta - make sure you are measuring what you think you are measuring! It measures every cycle between successive breakpoints. So will include any interrupts, any library calls etc. Take a look at the assembler output and make sure you really are only executing a MUL instruction - there may be a library call involved.