I've been chasing a problem where some graphics code we use runs slow when a rotated object is at a particular orientation. The rotation code then runs about 3 times slower than normal.
The problem was that at that orientation, the code was stepping through memory in 1724 byte increments, and that's 1/19 of the 32k L1 cache size, and 1/152 of the 256k L2 cache.
So it was a "Cache Buster", forcing reads back to main memory once it wrapped both caches.
So what are the relative latencies of L1, L2 and Memory on the i.MX53?
The Freescale Reference Manual ARM chapter says "go read the ARM manuals" and they say "the L2 delay is programmable to suit the L2 RAM you're using", and the Freescale manual doesn't detail what that is.
There's nothing I can find on the RAM timing or latency, but I'd hope it would be less than 100ns, as Intel chips can manage that:
So time to reverse-engineer... Anybody who wants to run these tests on their own hardware should download the following program and compile it:
Running it on our 1GHz Freescale i.MX53 Evaluation Board gives:
root@lucid-desktop:/tmp# nice --20 ./calibrator 1000 10M report
CPU loop + L1 access: 3.12 ns = 3 cy
( delay: 0.37 ns = 0 cy )
level size linesize miss-latency replace-time
1 32 KB 64 bytes 9.95 ns = 10 cy 10.54 ns = 11 cy
2 256 KB 64 bytes 178.68 ns = 179 cy 179.07 ns = 179 cy
level #entries pagesize miss-latency
1 32 4 KB 46.02 ns = 46 cy
In order to compile you may to rename all instances of the "round" function in the source.
The program even generates gnuplot files of the results which can then be graphed:
The above shows the L1 cache running out at 32k and the L2 running out at 256k.
The L1 miss isn't so bad, but the 188 clock L2 miss penalty (10 for L1 then another 178 for L2) is a lot longer than I'd expected.
It is also pretty easy to have code that gets TLB miss penalties too.