RT1050 is slower than RT1010

laszlo · ‎01-20-2020

With MCUXpresso I developed an application, which takes about 2.5 million CPU cycles on an i.MX RT1010 MCU, running at 500 MHz. Then I compiled and run the same C source code for the i.MX RT1050, with the same optimization (-os). The code is executed from the flash, cache enabled.

Surprisingly, the RT1050 takes 3 million CPU cycles for the same task (measured with the DWT registers).

- What can be the reason for the larger CPU cycle counts on the RT1050?

- What can I do to get the smaller CPU cycle counts of the RT1010 on the RT1050?

laszlo · ‎02-10-2020

Sorry for bumping this thread up. Does anyone have a clue, why the CPU cycle counts are so different between the RT1010 and RT1050 evaluation kits? I am about to give up, and use another vendor.

victorjimenez · ‎02-10-2020

Hello Laszlo,

Sorry for the late response, we have been under a heavy load of work. Could you please confirm whether or no you are using our evaluation boards to make these tests? Also, during the first test, both boards were running at 500MHz, correct? If so, could you please share the clock diagram of the RT1050-EVK?

Best Regards,

Victor

laszlo · ‎02-10-2020

- Of course, I was using the evaluation kits, which I got from NXP.

- During the 2 main tests I used the default clock configurations of MCUXpresso, which you can easily look at. In case of the RT1010 it is 500 MHz, in case of the RT1050 it is 600 MHz. These should not make a difference for the CPU cycle counts - unless the TCM or the cache has wait states, or the processor instruction pipelines are different.

mjbcswitzerland · ‎02-11-2020

Hi Laszlo

Can you compare running from closely coupled internal RAM only? This would give an exact CPU performance comparison by excluding somewhat random caching improvements.

If using external XIP Flash (QSPI) check that both boards have the same type - some have hyper flash (faster) and others have QSPI Flash.

Verify that the SPI speed is set the same for both otherwise one may take longer to load code and perform worst (even if the CPU were faster).

The i.MX RT 1011 has single precision floating point and the i.MX RT 1052 has double precision. If interrupts are involved the devices take longer to save and restore interrupt context when the FPU is enabled to when not enabled - when enabled and interrupts used during the test it may be possible for the double-precision context save/restore to be slightly slower than the single-precision one (although I never measured it), which could give the i.MX RT 1011 a slight advantage...

Regards

Mark

[uTasker project developer for Kinetis and i.MX RT]

victorjimenez · ‎01-27-2020

Hello Laszlo,

Are you able to reproduce this behavior with an example from the SDK? If so, could you please tell me which examples?

Regards,

Victor

laszlo · ‎01-28-2020

Hi Victor,

A simple loop added to the hello_word semihost SDK example gives the opposite results:

- RT1010: 11 million CPU cycles

- RT1050: 8 million CPU cycles

Here is the main file:

#include "fsl_device_registers.h"
#include "fsl_debug_console.h"
#include "board.h"

#include "pin_mux.h"
#include "clock_config.h"

#define CYCLES (*(volatile uint32_t*)0xE0001004)

int main(void) {
 volatile unsigned long long sum = 0; // volatile prevents compile time evaluations

    BOARD_ConfigMPU();
    BOARD_InitPins();
    BOARD_BootClockRUN();         // same results with BOARD_InitBootClocks()
    BOARD_InitDebugConsole();

    (*(volatile uint32_t*)0xE000EDFC) |= 1UL<<24; // Enable DWT cycle counter
    (*(volatile uint32_t*)0xE0001000) |= 1UL;     // Start cycle counter

    CYCLES = 0;
    for (int i = 0; i < 1e6; ++i) sum += 2*i + 1;

    PRINTF("CPU Cycles = %lu (%llu)\r\n", CYCLES,sum);

    for(;;) putchar(getchar());
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Both cases are in Release configuration. The optimization flag is either "-os" or "-o3", giving virtually the same results.

- These CPU cycle differences are equally puzzling. The MCU cores are the same: Arm Cortex M7. No data memory is used, the program is so short that it will surely fit the instruction cache.

- My real application is quite complex, using assembler optimized NTT code for polynomial multiplications, computing SHA-3 variants, etc. The only difference from the hello_world setup is that for the RT1010 case I reconfigured the tightly coupled SRAM, such that the SRAM_OC region is reduced to 32 KB, and the SRAM_DTC region is increased to 64 KB.

- I also experimented with the system clocks. In the case of the RT1050 I reduced the CPU clock frequency to 150 MHz, keeping all other clocks the same. This reduced the CPU cycle count from 3 to 2.5 million, but the board was not stable. (Occasionally crashed, and after a power cycle the debugger did not find the evaluation board, so I had to manually erase its external flash.) These indicate some hardware issues, not different compiler behaviors.

victorjimenez · ‎01-31-2020

Hello Laszlo,

Just to confirm, with this new test the results are the expected ones, correct? You also mentioned that you are using Semihosting, is this correct? If so, did you consider the following:

It is fair to say that the semihosting mechanism does not provide high performance I/O system. Each time a semihosting operation takes place, the processor is basically stopped whilst the data transfer takes place. The time this takes depends somewhat on the target CPU, the debug probe being used, the PC hardware and the PC operating system. But it takes a definite period of time, which may make your code appear to run more slowly.

Best regards,

Victor

laszlo · ‎01-31-2020

> with this new test the results are the expected ones

No! The CPU cycles should be equal, since the same program runs on the same Cortex M7 cores. The system clock frequency should not matter, but it seemingly does. My questions were: why, how, and how can I eliminate the differences. So far no one gave answers. It would also be helpful if someone could reproduce the results, so I can at least know that not faulty hardware causes the differences.

> Semihosting

The CPU cycle measurements I listed above exclude the I/O. The parameters of the PRINTF function are evaluated first, then the function is called. Only then semihosting is involved. This is the normal C behavior. (You can easily verify this by assigning CYCLES to a variable before PRINTF, and printing that variable, instead. The printed CPU cycles will be exactly the same.)

RT1050 is slower than RT1010

RT1050 is slower than RT1010

i.MXRT 101x

i.MXRT 105x