Right, there are no obvious reasons to have flash slower than RAM faster and manufacturers are trying to match speeds of flash, RAM and CPU, however with smaller than flash RAM arrays, it is possible to do some performance tricks:
- on S12 MCU 16bit memory access to word aligned address in flash takes 1 bus cycle, while misaligned word access takes 2 bus cycles. S12 RAM allows both aligned and misaligned write accesses in 1 bus cycle. So in fact RAM is bit faster. This doesn't make feasible to move code to RAM, because S12 instruction queue is there to make code fetched reading aligned words only. And if you have some data in flash, you can align it to make reads faster.
- on S12X MCU operating at 40MHz bus clock, XGATE core (interrupt coprocessor) is able to execute up to 2 instructions per bus cycle when executing from RAM, while it can do only up to 1 instruction per bus cycle when executing from flash. It It is about 2 times faster to execute XGATE code from RAM.
Regarding "different number of execution cycles for the same instruction and addressing mode". No, amount of cycles is the same, but some cycles are stretched with waitstates.