With MCUXpresso I developed an application, which takes about 2.5 million CPU cycles on an i.MX RT1010 MCU, running at 500 MHz. Then I compiled and run the same C source code for the i.MX RT1050, with the same optimization (-os). The code is executed from the flash, cache enabled.
Surprisingly, the RT1050 takes 3 million CPU cycles for the same task (measured with the DWT registers).
- What can be the reason for the larger CPU cycle counts on the RT1050?
- What can I do to get the smaller CPU cycle counts of the RT1010 on the RT1050?
Sorry for bumping this thread up. Does anyone have a clue, why the CPU cycle counts are so different between the RT1010 and RT1050 evaluation kits? I am about to give up, and use another vendor.
Hello Laszlo,
Sorry for the late response, we have been under a heavy load of work. Could you please confirm whether or no you are using our evaluation boards to make these tests? Also, during the first test, both boards were running at 500MHz, correct? If so, could you please share the clock diagram of the RT1050-EVK?
Best Regards,
Victor
- Of course, I was using the evaluation kits, which I got from NXP.
- During the 2 main tests I used the default clock configurations of MCUXpresso, which you can easily look at. In case of the RT1010 it is 500 MHz, in case of the RT1050 it is 600 MHz. These should not make a difference for the CPU cycle counts - unless the TCM or the cache has wait states, or the processor instruction pipelines are different.
Hi Laszlo
Can you compare running from closely coupled internal RAM only? This would give an exact CPU performance comparison by excluding somewhat random caching improvements.
If using external XIP Flash (QSPI) check that both boards have the same type - some have hyper flash (faster) and others have QSPI Flash.
Verify that the SPI speed is set the same for both otherwise one may take longer to load code and perform worst (even if the CPU were faster).
The i.MX RT 1011 has single precision floating point and the i.MX RT 1052 has double precision. If interrupts are involved the devices take longer to save and restore interrupt context when the FPU is enabled to when not enabled - when enabled and interrupts used during the test it may be possible for the double-precision context save/restore to be slightly slower than the single-precision one (although I never measured it), which could give the i.MX RT 1011 a slight advantage...
Regards
Mark
[uTasker project developer for Kinetis and i.MX RT]
Hello Laszlo,
Are you able to reproduce this behavior with an example from the SDK? If so, could you please tell me which examples?
Regards,
Victor
Hi Victor,
A simple loop added to the hello_word semihost SDK example gives the opposite results:
- RT1010: 11 million CPU cycles
- RT1050: 8 million CPU cycles
Here is the main file:
#include "fsl_device_registers.h"
#include "fsl_debug_console.h"
#include "board.h"
#include "pin_mux.h"
#include "clock_config.h"
#define CYCLES (*(volatile uint32_t*)0xE0001004)
int main(void) {
volatile unsigned long long sum = 0; // volatile prevents compile time evaluations
BOARD_ConfigMPU();
BOARD_InitPins();
BOARD_BootClockRUN(); // same results with BOARD_InitBootClocks()
BOARD_InitDebugConsole();
(*(volatile uint32_t*)0xE000EDFC) |= 1UL<<24; // Enable DWT cycle counter
(*(volatile uint32_t*)0xE0001000) |= 1UL; // Start cycle counter
CYCLES = 0;
for (int i = 0; i < 1e6; ++i) sum += 2*i + 1;
PRINTF("CPU Cycles = %lu (%llu)\r\n", CYCLES,sum);
for(;;) putchar(getchar());
}
Both cases are in Release configuration. The optimization flag is either "-os" or "-o3", giving virtually the same results.
- These CPU cycle differences are equally puzzling. The MCU cores are the same: Arm Cortex M7. No data memory is used, the program is so short that it will surely fit the instruction cache.
- My real application is quite complex, using assembler optimized NTT code for polynomial multiplications, computing SHA-3 variants, etc. The only difference from the hello_world setup is that for the RT1010 case I reconfigured the tightly coupled SRAM, such that the SRAM_OC region is reduced to 32 KB, and the SRAM_DTC region is increased to 64 KB.
- I also experimented with the system clocks. In the case of the RT1050 I reduced the CPU clock frequency to 150 MHz, keeping all other clocks the same. This reduced the CPU cycle count from 3 to 2.5 million, but the board was not stable. (Occasionally crashed, and after a power cycle the debugger did not find the evaluation board, so I had to manually erase its external flash.) These indicate some hardware issues, not different compiler behaviors.
Hello Laszlo,
Just to confirm, with this new test the results are the expected ones, correct? You also mentioned that you are using Semihosting, is this correct? If so, did you consider the following:
It is fair to say that the semihosting mechanism does not provide high performance I/O system. Each time a semihosting operation takes place, the processor is basically stopped whilst the data transfer takes place. The time this takes depends somewhat on the target CPU, the debug probe being used, the PC hardware and the PC operating system. But it takes a definite period of time, which may make your code appear to run more slowly.
Best regards,
Victor
> with this new test the results are the expected ones
No! The CPU cycles should be equal, since the same program runs on the same Cortex M7 cores. The system clock frequency should not matter, but it seemingly does. My questions were: why, how, and how can I eliminate the differences. So far no one gave answers. It would also be helpful if someone could reproduce the results, so I can at least know that not faulty hardware causes the differences.
> Semihosting
The CPU cycle measurements I listed above exclude the I/O. The parameters of the PRINTF function are evaluated first, then the function is called. Only then semihosting is involved. This is the normal C behavior. (You can easily verify this by assigning CYCLES to a variable before PRINTF, and printing that variable, instead. The printed CPU cycles will be exactly the same.)