AnsweredAssumed Answered

i.MX53 Cache and Memory Speeds and Latency

Question asked by TomE on Nov 21, 2013
Latest reply on Feb 26, 2014 by Yixing Kong

I've been chasing a problem where some graphics code we use runs slow when a rotated object is at a particular orientation. The rotation code then runs about 3 times slower than normal.

 

The problem was that at that orientation, the code was stepping through memory in 1724 byte increments, and that's 1/19 of the 32k L1 cache size, and 1/152 of the 256k L2 cache.


So it was a "Cache Buster", forcing reads back to main memory once it wrapped both caches.

 

So what are the relative latencies of L1, L2 and Memory on the i.MX53?

 

The Freescale Reference Manual ARM chapter says "go read the ARM manuals" and they say "the L2 delay is programmable to suit the L2 RAM you're using", and the Freescale manual doesn't detail what that is.


There's nothing I can find on the RAM timing or latency, but I'd hope it would be less than 100ns, as Intel chips can manage that:

 

http://www.xbitlabs.com/articles/memory/display/core2duo-memory-guide_2.html

 

So time to reverse-engineer... Anybody who wants to run these tests on their own hardware should download the following program and compile it:

 

The Calibrator (v0.9e), a Cache-Memory and TLB Calibration Tool

 

Running it on our 1GHz Freescale i.MX53 Evaluation Board gives:

 

root@lucid-desktop:/tmp# nice --20 ./calibrator 1000 10M report

 

Calibrator v0.9e

(by Stefan.Manegold@cwi.nl, http://www.cwi.nl/~manegold/)

 

CPU loop + L1 access:       3.12 ns =   3 cy

             ( delay:       0.37 ns =   0 cy )

 

caches:

level  size    linesize   miss-latency        replace-time

  1     32 KB   64 bytes    9.95 ns =  10 cy   10.54 ns =  11 cy

  2    256 KB   64 bytes  178.68 ns = 179 cy  179.07 ns = 179 cy

 

TLBs:

level #entries  pagesize  miss-latency

  1       32       4 KB    46.02 ns =  46 cy

 

In order to compile you may to rename all instances of the "round" function in the source.

 

The program even generates gnuplot files of the results which can then be graphed:

 

report.cache-miss-latency.gif

The above shows the L1 cache running out at 32k and the L2 running out at 256k.


The L1 miss isn't so bad, but the 188 clock L2 miss penalty (10 for L1 then another 178 for L2) is a lot longer than I'd expected.

 

It is also pretty easy to have code that gets TLB miss penalties too.

 

Tom

Outcomes