I am working on a bare metal project which involves running a real time protocol encode/decode application. The project originally started using TI’s OMAP4460 device (containing a dual core ARM Cortex A9 clocked at 1.2GHz) but for a number of reasons (mostly hardware related) we have move to the SCM-iMX6Q. I am evaluating the performance of the SCM-i.MX6Q using a QWKS-SCMIMX6 off the shelf development board, comparing it to the OMAP.
Our evaluation involves running some sample protocol encode/decode routines on arrays of data in memory (so does no external I/O). It all runs on a single ARM core, the others beings disabled. We only have access to the object libraries for this evaluation code (which is provided by a partner company in the project) this is built using TI Code Composer, and is in fact the same object code which I can run on the OMAP or the SCM-i.MX6Q. (i.e. I can link exactly the same libraries into my OMAP project as my SCM-i.MX6Q project). In both cases the device initialisations have come from the standard U-Boot sources for each device type, (clock and memory configurations, DCD configuration etc.).
In the SCM-iMX6Q the ARM is clocked at 800MHz and the OMAP at 1200MHz, the OMAP also uses the same PoP LPDDR2 RAM as the SCM-i.MX6Q, so I would expect running the same code in the same circumstance I would see roughly two thirds of the performance of the OMAP when running on the SCM-iMX6Q. Unfortunately the performance difference I see is huge,
Overall decoding and encoding finished: 16757906 = 16.757mS
Overall decoding and encoding finished: 136581353 = 136.581mS
The OMAP is more than 8 times faster! These timing are taken using the internal ARM performance counter. The test routines run exclusively on the processor with nothing else running and interrupts disabled, so it is pure “number crunching”. The test is very processor and memory intensive.
I have the L1 I/D caches enabled, the L2 cache is enabled, and the MMI is configured to map all of the LPDDR2 RAM addresses as cacheable (TTB_ENTRY_SUPERSEC_NORM equ 0x55C06). The clock settings appear to match what I see if I boot Linux then stop in U-Boot and display the clock settings. Using the same display code from U-Boot built into my project (after my initialisation of the hardware) I see these clock settings, which match what I see if U-Boot does the initialisation.
PLL_SYS 792 MHz
PLL_BUS 528 MHz
PLL_OTG 480 MHz
PLL_NET 50 MHz
ARMCLK 792000 kHz
IPG 66000 kHz
UART 80000 kHz
CSPI 60000 kHz
AHB 132000 kHz
AXI 198000 kHz
DDR 396000 kHz
USDHC1 198000 kHz
USDHC2 198000 kHz
USDHC3 198000 kHz
USDHC4 198000 kHz
EMI SLOW 99000 kHz
IPG PERCLK 66000 kHz
I am beginning to think that the problem has something to do with the L1 cache in the SCM.iMX6Q. If I do not enable the L1 cache in the SCM.iMX6Q I see only a small amount of difference in the performance, however if I do the same in my test using the OMAP there is a huge difference in performance (the OMAPs encode/decode times become 85mS). Is there something I am missing about configuring the L1 cache which is different to the ARM in the OMAP?
Clearly I am using difference build environments, Code Composer for the OMAP and IAR Workbench for the SCM-iMX6Q. So just in case the different C libraries were the cause of the problem I have tried building my project for the SCM-iMX6Q using TI’s C library instead of the one provided by IAR. It makes no difference at all to the timings.
I have been investigating this problems for some time now and am really running out of ideas as to why there is such a large difference in performance. Either there is some device configuration I have overlooked or there is really a big difference in the architecture between these two devices which is beyond my control. Any help would be much appreciated!
Original Attachment has been moved to: Boot_DCD.c.zip