IMX7 M4 caching and execution speed

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

IMX7 M4 caching and execution speed

5,046 Views
arnoutdiels
Contributor III

Hi,

We are writing bare metal code for the M4 on the IMX7.

We noticed a huge difference in execution speed between TCM code execution, OCRAM and DDR.

A simple while loop of a = a + 1 (a being a volatile long long int) results in the following measurements (no lmem caching done)

- 24M/s loops in TCM

- 3M/s loops in OCRAM

- 0.7M/s loops in DDR

> From the reference manual, it seems that DDR can only be cached with the LMEM controller in the memory range of 0x8000_0000 until 0x801F_FFFF. Is this correct?) If so, is there an application note / info about how to avoid linux from using this memory? 

OCRAM is fast, but still a lot slower than TCM. This memory (0x2020_0000 - 0x203F_FFFF) should be able to be cached though. In order to try this we:

- Use the imx_driver code, and launch:

LMEM_EnableSystemCache(LMEM);
LMEM_EnableCodeCache(LMEM);

This has no significant effect though.

- We also tried to configure the MPU, by marking all other regions as non-cacheable, and only the OCRAM region as cacheable.. This also did not have any positive effect.

> Are we missing something here? Is there example code available that correctly uses the LMEM imx driver code to accelerate OCRAM (or any other region)?

Thanks in advance for your reply.

10 Replies

2,986 Views
TomE
Specialist II

> - 24M/s loops in TCM

> - 3M/s loops in OCRAM

> - 0.7M/s loops in DDR

The OCRAM seems to be a lot slower than you'd expect. My experience is with the i.MX53. The OCRAM is documented in the manual as having a "one or two clock access". Great, that looks FAST!

Except it doesn't say anywhere that the OCRAM is only being clocked at 133MHz. Compared to the CPU at 800MHz or faster. OK, so it is 6 or 12 times slower than you'd expect from reading the manual. So you'd expect 6 to 12 CPU Clocks for a read,

Except that testing showed it takes SEVENTEEN 133MHz clocks, or 103 CPU clocks! That's about 130ns. That's a 7.7MHz memory system. I haven't seen memory that slow in about 20 years. Details, tables, measurements of various memory systems here:

https://community.nxp.com/message/514260


Freescale's responses were firstly "that is explained by the 133MHz clock" (which it isn't) and "you're accessing it through the Linux File System" (which I wasn't), then finally "The i.MX53 has complicated structure, it includes many peripheral modules, several internal buses, as result some delays may be observed because of arbitration,bus turn-arounds, etc.".

This sort of memory only seems to be useful for initial bootstrapping. You only find out it is slowing things down badly (like the NFC NAND Flash controller reading at 4.8 MB/s) when you measure it.

Tom

2,986 Views
arnoutdiels
Contributor III

Hi,

Do you have any feedback regarding the configuration of the LMEM cache?

Kind regards,

Arnout

0 Kudos
Reply

2,986 Views
Yuri
NXP Employee
NXP Employee

Hello,

  The CACHE for the CM4 is not functioning on the rev 1.0 silicon.

 Also, please refer to RTOS sources how to work with LMEM.

http://www.nxp.com/webapp/Download?colCode=FreeRTOS_iMX7D_1.0.1_LINUX&appType=license&location=null&... 

i.MX 6 / i.MX 7 Series Software and Development Tool|NXP 

Regards,

Yuri.

0 Kudos
Reply

2,986 Views
arnoutdiels
Contributor III

Hi,

Thanks for your reply, but can you elaborate a bit more on that?

You refer to the FreeRTOS sources. These have functions in lmem.c, but they are not used in any of the examples/code. The only place where LMEM macro's seem to be accessed is in the startup.c:

 /* Initialize Cache */
 /* Enable System Bus Cache */
 /* set command to invalidate all ways, enable write buffer
 and write GO bit to initiate command */
 LMEM_PSCCR = LMEM_PSCCR_INVW1_MASK | LMEM_PSCCR_INVW0_MASK;
 LMEM_PSCCR |= LMEM_PSCCR_GO_MASK;
 /* wait until the command completes */
 while (LMEM_PSCCR & LMEM_PSCCR_GO_MASK);
 /* Enable cache, enable write buffer */
 LMEM_PSCCR = (LMEM_PSCCR_ENWRBUF_MASK | LMEM_PSCCR_ENCACHE_MASK);

This however seems only to initialize the system cache, not the code cache. 

The non-functioning part you mention:

- Is this code cache?

- Is this system cache?

- Are all memory regions affected, or not?

- What is the easiest way to determine the silicon version, and in which part numbers is this fixed/not fixed/will this be fixed?

Thanks in advance for your reply

0 Kudos
Reply

2,986 Views
Yuri
NXP Employee
NXP Employee

Hello,

  it is possible to look at Your codes with cache enabled ?

You may create request to send it.

Sales and Support|NXP 

Regards,

Yuri.

0 Kudos
Reply

2,986 Views
arnoutdiels
Contributor III
Yuri,

1) How exactly does the code snipped from my previous post enable the -code- cache? PSCCR refers to the system cache, PCCCR refers to code cache. This last one is not refered anywhere, except for in the function LMEM_EnableCodeCache, which is not called. 

2) Thanks for the pointer. My IMX7D has revision 1.2 (C as last character in the name), so I assume the cache should work.

3) I don't know what you mean by "create a request to send it".

Anyway, this is pretty easy to test yourself.

Just take the FreeRTOS_BSP_1.0.1_iMX7D, take the hello_world and hello_world_ddr examples. Add the following code:

////////////////////////////////////////////////////////////////////////////////
// Code
////////////////////////////////////////////////////////////////////////////////
volatile long long int a;

/*!
 * @brief A basic user-defined task
 */
void HelloTask(void *pvParameters)
{
 uint8_t receiveBuff;

 while(1)
 {
 a = a + 1;
 }

And just see how fast a increments in e.g. 10 seconds.

In DDR, this is about 0.7M times per second, which is waaay lower than TCM.

The main function is called from system_MCIMX7D_M4.c, which 'should' initialize all caches as you say.

Even when I call  LMEM_EnableSystemCache(LMEM) and LMEM_EnableCodeCache(LMEM), the result stays the same.

I hope you can now also test this yourself, and are able to provide answers to my initial questions.

0 Kudos
Reply

2,986 Views
Yuri
NXP Employee
NXP Employee

Hello,

  "As it turns out, the M4 cache has been optimized for qspi operation and does not have a performance effect on ddr memory accesses. Basically the cache-able memory does not include the ddr. And therefor there will be no difference

in applications operating from ddr with and without the caches turned on."

Regards,

Yuri.

2,986 Views
TomE
Specialist II

Yuri wrote:

> As it turns out, the M4 cache has been optimized for qspi operation and does not have a

> performance effect on ddr memory accesses.

I thought that response had been proved to be wrong in this post:

https://community.nxp.com/message/953330?commentID=953330#comment-953330

Tom

0 Kudos
Reply

2,986 Views
Yuri
NXP Employee
NXP Employee

Hello

 

  Please look at my comments below.

 

1.

  The mentioned code enables the LMEM cache, assuming both cache controllers
reside within the LMEM. “Low-order addresses (0x0000_0000 through 0x1FFF_FFFF)
use the Processor Code (PC) bus, and high-order addresses (0x2000_0000 through

0xFFFF_FFFF) use the Processor System (PS) bus”.

 

 

2.

  As for silicon rev., please refer to i.MX7 Datasheet(s),

Figure 1 (Part number nomenclature). The recent letter (A/B/C)

defines the rev.

 

http://www.nxp.com/assets/documents/data/en/data-sheets/IMX7DCEC.pdf 

Have a great day,
Yuri

0 Kudos
Reply

2,986 Views
Yuri
NXP Employee
NXP Employee

Hello,

 

  We do not have performance estimations for CM4 regarding different kinds
of i.MX7 memory. The TCM is mainly intended for the CM4.

Regards,

Yuri

------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer

button. Thank you!

0 Kudos
Reply