> in fact I have my own memmove() who checks if the blocks are in multiples
> of 2 or 4 bytes and the addresses are paired to get profit of the word or
> long move, without MOVEM L/W , which is not interrumpible.
The first thing the code should do is check to see if the copy is long enough (16 bytes or more) to bother with running more code to work out how best to do the copy. Then the usual thing is byte-copies until source or destination (doesn't matter in your case) are 4-byte aligned and then do as many 4/8/16 unrolled or MOVEML copies as it can, before finishing up the end with word and then byte copies until finished.
You don't worry about misalignment with this CPU. It can perform misaligned transfers faster than anything you can do. If you're performing 8 or 16-bit copies because of 8 or 16-bit misalignment, then you're going really slowly for no reason.
What's the problem with not being interruptible? A four-register MOVEML takes one clock for the instruction fetch and four for the reads or write. Lots of normal instructions take the same. MOIVEC takes NINE clocks. DIVSW takes 20 and DIVSL takes 35. Your function preambles are probably using MOVEML to stash the registers on the stack they're going to use in the function. At least MUL only takes 4 clocks on this CPU. Can you put the EMAC to any good use?
> sometimes with surprises in execution time depending upon the address
That may be the compiler doing something unexpected that makes it take longer. Are you using CW or are you using GCC?
You might be able to get your extra speed by using gcc (with it unrolling everything).
> but I suspect than the execution can speed-up in flash and
> much more in internal ram,
80MHz is hard. The higher clock rates you see in more modern parts are hard won, and often due to pipelining. Yes, modern DDR3 runs at 266MHz, but it takes 6 or more clocks before a read returns data. This SRAM is returning data on the SAME clock. Go look for CFMCLKSEL. Note the Flash is already set at the factory for 1-1-1-1 or 2-1-1-1? Someone's been working hard to make it as fast as possible. Note how the FLASH is double-bank-interleaved? That means its access time is really 40MHz and it reads 64 bits at once to be able to read at 80MHz. Also the FLASHBAR[AFS] bit where it prefetches? That means it is trying hard to keep up. That means two things - linear code is faster than code with a lot of branches (where the prefetch fails) and "they're already pushing it at 80MHz". I wouldn't try to go any faster EXCEPT as a "hobby" and not for "production". Remember it is going to start by randomly corrupting data rather than hard failing when it can't keep up at some temperature or some data access pattern.
> For checking the speed of a function, I use a timer with ns
> resolution, I imagine is that what you are saying.\
What I'm saying is that I have an interrupt service routine sample the program counter, and then print out the results, like this:
log_filt_init_normal 1034 0.54%
memcmp 1074 0.56%
mip_udpconn_init 2141 1.12%
can_comm 2255 1.18%
mip_pbuf_decref 3274 1.72%
pit0_isr 3336 1.75%
do_pin_cal 3444 1.81%
get_time_us 3620 1.90%
vimcom_cycle 7315 3.85%
log_usb_data_send 8458 4.45%
log_to_flash 9311 4.90%
main_loop 10279 5.41%
memcpy 10919 5.74%
fec_rx_isr 21048 11.08%
main_proc_loop.lto_priv.383 93529 49.23%
They're the function names on the left. The next column is the number of samples. The percentage is the execution time of that function as a percentage of the whole. The Ethernet interrupt is the function taking the most of the time (the 49% item is the idle loop).
The raw data (that is captured and is used to generate reports like the above)) looks like this:
0x80130734 = 70
0x80130736 = 748
0x8013073a = 3
0x8013073e = 166
0x80130742 = 83
0x80130746 = 538
0x80130748 = 3
0x8013074c = 460
0x8013074e = 47
0x80130752 = 33730
0x80130756 = 114427
0x80130758 = 9881
0x8013075e = 31205
0x80130760 = 21261
0x80130762 = 9375
0x80130764 = 39
0x80130768 = 26
0x8013076e = 388
0x80130770 = 18
0x80130774 = 27
0x8013077a = 273
0x8013077e = 110
0x8013078c = 28
That says the instruction at "0x80130756" is taking the most time (it was sampled 114427 times) so then I look at a disassembly to find out why that one is taking so long. That lets me work out exactly where within an instruction the time is going. I can then make changes and measure the different that change made. The above often finds that the function preamble and postamble (pushing registers to the stack and getting them back) takes longer to execute than anything else in some functions. That's why it is worth in-lining simple functions if the compiler isn't smart enough to do that automatically. Here's an example of that where the samples per instruction are in the left column (287 clocks taken by "moveml" on previous line):
801110fa <rijndael_ecb_encrypt>:
59: 801110fa: 720a moveq #10,%d1
42: 801110fc: 4e56 ff94 linkw %fp,#-108
56: 80111100: 226e 0010 moveal %fp@(16),%a1
56: 80111104: 48d7 3cfc moveml %d2-%d7/%a2-%a5,%sp@
287: 80111108: 206e 0008 moveal %fp@(8),%a0
60: 8011110c: 2029 01e0 movel %a1@(480),%d0
59: 80111110: b280 cmpl %d0,%d1
9: 80111112: 6700 063a beqw 8011174e <rijndael_ecb_encrypt+0x654>
Tom