Enable Code Bus Cache for Cortex-M4

falstaff · ‎05-23-2017

Hello,

I am trying to enable caches for DDR memory for the Cortex-M4. First I was surprised to see that only the first 2MB of the DDR memory seem to be cachable (according to 4.2.9.3.5 Cache Function). Does this also apply to the code bus? This is somewhat unfortunate since the default Linux relocation address is in this area... I continued the tests without Linux (just leave the A7 in U-Boot). I setup my linker file such that the firmware uses the code bus for code and data bus for data:

/* Specify the memory areas */
MEMORY
{
    m_interrupts (RX) : ORIGIN = 0x10000000, LENGTH = 0x00000240
    m_text (RX) :       ORIGIN = 0x10000240, LENGTH = 0x00070000
    m_data (RW) :       ORIGIN = 0x80080000, LENGTH = 0x00080000
}

I added an MPU entry like the one already present for the system bus (0x80000000) and enable the code cache (LMEM_PCCCR) in platform/devices/MCIMX7D/startup/system_MCIMX7D_M4.c the same way as the system cache has already been enabled (LMEM_PSCCR).

The results were rather disappointing: I measured 1296ms without caches and 1638ms with code cache enabled! When I use a linker file where I locate the code and data in the system bus region (0x80000000, the execution time drops to 87ms! Somehow it seems that the code cache is not working properly...

How can I use the code bus and enable the code cache?

Thanks

Stefan

Note: I use a C micro benchmark to and task ticks to measure execution speed (benchmark courtesy to the gist).

double squareroot(double x)
{
    double it = x;
    while (fabs(it*it - x) > 1e-13) {
       it = it - (it*it-x)/(2*it);
    }
    return it;
}

void benchmark(void)
{
    const int num_iter = 1000;
    TickType_t t = xTaskGetTickCount();
    volatile double sum_real = 0;
    for (int i = 0; i < num_iter; i++) {
        sum_real += squareroot(10000.0);
    }
    TickType_t ttot = xTaskGetTickCount() - t;
    PRINTF("%d milliseconds\n\r", ttot);
}

Yuri · ‎05-23-2017

Hello,

From https://community.nxp.com/thread/446985

"As it turns out, the M4 cache has been optimized for qspi operation and does not have a performance effect on ddr memory accesses. Basically the cache-able memory does not include the ddr. And therefore there will be no difference

in applications operating from ddr with and without the caches turned on."

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

davidwightman · ‎11-28-2017

Hi Yuri,

I repeated Stefan's test, but included QSPI. I saw the same behavior where benchmark times running code from the Master0 PC bus code cache are extremely high compared to running code through the Master1 PS bus system cache.

I think in both this thread and the thread you point Stefan to above IMX7 M4 caching and execution speed , your answer about the cache being optimized for QSPI isn't really addressing the question being asked. Note that I do see similar benchmark times between the DDR and QSPI when I use the Master1 PS System bus & Cache for both Data and Code, where your answer would make more sense. The question is whether the Code Cache is usable or not.

The ambiguity in the Reference manual is whether the listed cachable addresses are the addresses as seen by the main core, or if they are as seen by the M4 core. It seems like it is the latter, and because all addresses are > 2000_0000, they are only usable by the System bus.

Note that I am also using FreeRTOS and the LMEM driver within, so perhaps there is an issue with the driver. However, there is this bare-metal M4 bring up and benchmark test where the same behavior is seen using bare metal.

It would be nice to get a confirmation that code cache isn't usable and developers should stick to the System cache, if it is in fact the case, because there seem to be a lot of people falling into this trap.

Br,

David

falstaff · ‎05-24-2017

According to my measurements, when using system bus for code and data, caches make a difference also for DDR memory. E.g. when linking .text/.data section to 0x80000000 (must be between 0x80000000-0x80200000 according to 4.2.9.3.5 Cache Function) the performance with disabled/enabled cache is 1357ms/72ms!

I was just wondering how to make use of the code cache, since it seems rather useless even for QSPI: According to the mentioned chapter 4.2.9.3.5 Cache Function, only 0x60000000-0x603FFFFF are supported. In fact, none of the addresses lists an address in the code bus. It seems that the code bus cache is essentially useless?

Yuri · ‎05-30-2017

Hello,

You may look at app note AN4947 (Understanding Vybrid Architecture) with similar asymmetrical
multiprocessing architecture. According to its conclusion Cortex-M4 core is intended to work with TCM.

http://www.nxp.com/assets/documents/data/en/application-notes/AN4947.pdf

Regards,

Yuri.

Enable Code Bus Cache for Cortex-M4

Enable Code Bus Cache for Cortex-M4

i.MX7Dual