The following loop :
executes 799 loops per unit of time (arbitrary) when fully inside a L1 cache line (32 bytes address boundaries), and executes only 531 loops per the same unit time when a L1 cache line boudary lies inside the loop code.
This is my hypothesis that this performance issue comes from the cache line boundary problem. I could not confirm it using Performance Monitor Counters.
Could you confirm that the change of the loop speed is related to cache line alignment ?
I have read AN2665 (Fecth Fact 2).
I suspect that it is very visible because the loop is very short. What about longer loops/code ?
Should I use function/loop/label alignement options of the compiler ? (this would inserts many nop instructions).