Easy one. You're assuming the GPIO pins can toggle quickly. They can't.
There are "bridges" between the CPU (running on a fast clock) and the peripherals, running on way slower clocks. In the worst case I know of (PXA270) it takes 200 CPU clocks to toggle a GPIO pin.
It isn't so bad on Coldfire, but on some of them it can take 15 clocks.
Some of the chips have "RGPIO" or "Rapid GPIO", which says the normal GPIO isn't. I don't think your chip has this.
So read these for details:
https://community.freescale.com/message/328081#328081
The above also links to:
Re: Help with MCF5475 speed problem.
On V2 processors all accesses to GPIO registers spends I think 12 wait states. On V4 it may be even worse.
overcoming the 12 cycle GPIO waitstate for TFT LCD
Re: MCF5307, execution speed question
Re: excution time
The above should help you to get your test code running sensibly.
You should be writing longer loops - do something in a loop 100 times and measure that.
Everything should run in one clock per bus cycle, maybe less sometimes. The only common outliers are the DIV and REMS ones which are 20-35 clocks.
Read the Flash chapter. Flash runs at one cycle per clock. Sort of. One or two. After a 2 clock latency. The SRAM should be single clock.
You should ignore "waving GPIO pins" for testing timing. The best approach is to program a DMA Timer to free-run at 1MHz, and then have your code read the free running (mocrosecond) counter at the start of a test, read at the end, subtract and print. Note that it takes about as long to read a register in the DMA Timer as it does to write to a GPIO port, so you need to calibrate how long the timer reads take (in a loop that does that) and then subtract that from your other tests.
You test that code by waiting until 1,000,000 microsecond counts have happened and then print (or toggle a GPIO), and repeat. It should print once per second.
Assuming all that is OK, are you running RAM or FLASH? At least you don't have a cache to worry about. Getting that wrong on the faster chips can really slow the CPU down.
I'm guessing you're running in RAM. Read "Table 13-3. RAMBAR Field Description" and "Table 11-2. RAMBAR Field Descriptions". Try changing the SRAM priority so the CPU is higher.
Some CPUs need you to set a bit somewhere to let the CPU get to the SRAM via the "back door", or it runs really slow. Your one doesn't seem to do that (luckily), but there is mention in this section, which you should read: "13.6 Internal Bus Arbitration".