AnsweredAssumed Answered

Performance problems on SCM-i.MX6Q (as compared to TI OMAP4460)

Question asked by Neil Turner on Mar 23, 2017
Latest reply on May 26, 2017 by Michael Guntli



I am working on a bare metal project which involves running a real time protocol encode/decode application. The project originally started using TI’s OMAP4460 device (containing a dual core ARM Cortex A9 clocked at 1.2GHz) but for a number of reasons (mostly hardware related) we have move to the SCM-iMX6Q. I am evaluating the performance of the SCM-i.MX6Q using a QWKS-SCMIMX6 off the shelf development board, comparing it to the OMAP.


Our evaluation involves running some sample protocol encode/decode routines on arrays of data in memory (so does no external I/O). It all runs on a single ARM core, the others beings disabled. We only have access to the object libraries for this evaluation code (which is provided by a partner company in the project) this is built using TI Code Composer, and is in fact the same object code which I can run on the OMAP or the SCM-i.MX6Q. (i.e. I can link exactly the same libraries into my OMAP project as my SCM-i.MX6Q project). In both cases the device initialisations have come from the standard U-Boot sources for each device type, (clock and memory configurations, DCD configuration etc.).


In the SCM-iMX6Q the ARM is clocked at 800MHz and the OMAP at 1200MHz, the OMAP also uses the same PoP LPDDR2 RAM as the SCM-i.MX6Q, so I would expect running the same code in the same circumstance I would see roughly two thirds of the performance of the OMAP when running on the SCM-iMX6Q. Unfortunately the performance difference I see is huge,


OMAP4460 Performance:

Overall decoding and encoding finished: 16757906 = 16.757mS


NXP Performance:

Overall decoding and encoding finished: 136581353 = 136.581mS


The OMAP is more than 8 times faster! These timing are taken using the internal ARM performance counter. The test routines run exclusively on the processor with nothing else running and interrupts disabled, so it is pure “number crunching”. The test is very processor and memory intensive.


I have the L1 I/D caches enabled, the L2 cache is enabled, and the MMI is configured to map all of the LPDDR2 RAM addresses as cacheable (TTB_ENTRY_SUPERSEC_NORM equ    0x55C06). The clock settings appear to match what I see if I boot Linux then stop in U-Boot and display the clock settings. Using the same display code from U-Boot built into my project (after my initialisation of the hardware) I see these clock settings, which match what I see if U-Boot does the initialisation.


Clock Settings:

PLL_SYS         792 MHz

PLL_BUS         528 MHz

PLL_OTG         480 MHz

PLL_NET          50 MHz

ARMCLK       792000 kHz

IPG           66000 kHz

UART          80000 kHz

CSPI          60000 kHz

AHB          132000 kHz

AXI          198000 kHz

DDR          396000 kHz

USDHC1       198000 kHz

USDHC2       198000 kHz

USDHC3       198000 kHz

USDHC4       198000 kHz

EMI SLOW      99000 kHz

IPG PERCLK    66000 kHz


I am beginning to think that the problem has something to do with the L1 cache in the SCM.iMX6Q. If I do not enable the L1 cache in the SCM.iMX6Q I see only a small amount of difference in the performance, however if I do the same in my test using the OMAP there is a huge difference in performance (the OMAPs encode/decode times become 85mS). Is there something I am missing about configuring the L1 cache which is different to the ARM in the OMAP?


Clearly I am using difference build environments, Code Composer for the OMAP and IAR Workbench for the SCM-iMX6Q. So just in case the different C libraries were the cause of the problem I have tried building my project for the SCM-iMX6Q using TI’s C library instead of the one provided by IAR.  It makes no difference at all to the timings.   


I have been investigating this problems for some time now and am really running out of ideas as to why there is such a large difference in performance. Either there is some device configuration I have overlooked or there is really a big difference in the architecture between these two devices which is beyond my control. Any help would be much appreciated!




Original Attachment has been moved to: