I am working on a bare metal project which involves running a real time protocol encode/decode application. The project originally started using TI’s OMAP4460 device (containing a dual core ARM Cortex A9 clocked at 1.2GHz) but for a number of reasons (mostly hardware related) we have move to the SCM-iMX6Q. I am evaluating the performance of the SCM-i.MX6Q using a QWKS-SCMIMX6 off the shelf development board, comparing it to the OMAP.
Our evaluation involves running some sample protocol encode/decode routines on arrays of data in memory (so does no external I/O). It all runs on a single ARM core, the others beings disabled. We only have access to the object libraries for this evaluation code (which is provided by a partner company in the project) this is built using TI Code Composer, and is in fact the same object code which I can run on the OMAP or the SCM-i.MX6Q. (i.e. I can link exactly the same libraries into my OMAP project as my SCM-i.MX6Q project). In both cases the device initialisations have come from the standard U-Boot sources for each device type, (clock and memory configurations, DCD configuration etc.).
In the SCM-iMX6Q the ARM is clocked at 800MHz and the OMAP at 1200MHz, the OMAP also uses the same PoP LPDDR2 RAM as the SCM-i.MX6Q, so I would expect running the same code in the same circumstance I would see roughly two thirds of the performance of the OMAP when running on the SCM-iMX6Q. Unfortunately the performance difference I see is huge,
Overall decoding and encoding finished: 16757906 = 16.757mS
Overall decoding and encoding finished: 136581353 = 136.581mS
The OMAP is more than 8 times faster! These timing are taken using the internal ARM performance counter. The test routines run exclusively on the processor with nothing else running and interrupts disabled, so it is pure “number crunching”. The test is very processor and memory intensive.
I have the L1 I/D caches enabled, the L2 cache is enabled, and the MMI is configured to map all of the LPDDR2 RAM addresses as cacheable (TTB_ENTRY_SUPERSEC_NORM equ 0x55C06). The clock settings appear to match what I see if I boot Linux then stop in U-Boot and display the clock settings. Using the same display code from U-Boot built into my project (after my initialisation of the hardware) I see these clock settings, which match what I see if U-Boot does the initialisation.
PLL_SYS 792 MHz
PLL_BUS 528 MHz
PLL_OTG 480 MHz
PLL_NET 50 MHz
ARMCLK 792000 kHz
IPG 66000 kHz
UART 80000 kHz
CSPI 60000 kHz
AHB 132000 kHz
AXI 198000 kHz
DDR 396000 kHz
USDHC1 198000 kHz
USDHC2 198000 kHz
USDHC3 198000 kHz
USDHC4 198000 kHz
EMI SLOW 99000 kHz
IPG PERCLK 66000 kHz
I am beginning to think that the problem has something to do with the L1 cache in the SCM.iMX6Q. If I do not enable the L1 cache in the SCM.iMX6Q I see only a small amount of difference in the performance, however if I do the same in my test using the OMAP there is a huge difference in performance (the OMAPs encode/decode times become 85mS). Is there something I am missing about configuring the L1 cache which is different to the ARM in the OMAP?
Clearly I am using difference build environments, Code Composer for the OMAP and IAR Workbench for the SCM-iMX6Q. So just in case the different C libraries were the cause of the problem I have tried building my project for the SCM-iMX6Q using TI’s C library instead of the one provided by IAR. It makes no difference at all to the timings.
I have been investigating this problems for some time now and am really running out of ideas as to why there is such a large difference in performance. Either there is some device configuration I have overlooked or there is really a big difference in the architecture between these two devices which is beyond my control. Any help would be much appreciated!
Original Attachment has been moved to: Boot_DCD.c.zip
My Page Table Entries are actually using a different way to define the cache control which is not shown in your table. There is one more entry for the TEX field as follows.
TEX C B
1BB A A Cached memory
BB = Outer ploicy
AA = Inner policy
See Table 6-3
Table 6-3 then defines,
BB or AA bits Cache policy
b01 Write-Back cached, Write Allocate
b10 Write-Through cached, No Write on Allocate
b11 Write-Back cached, No Write on Allocate
So my entry of 0x55C06 defines both inner and outer as "Write-Back cached, Write on Allocate". This is also what I used on the OMAP.
I tried changing to 0x50C0E which would set the TEX=b000, C=1, B=1 and it makes no difference to the speed. If I change to 0x50C0A, TEX=b000, C=1, B=0 then my test runs much slower (by about half, 270ms).
It seems the bufferable option was enabled already. One last thing, did you enable the branch prediction in the cortex?
This helps to improve the performance in routines that consist of copying big chuncks of data.
Without the code and a way to test on my side it is not easy to find the root cause. Is it possible to share the code? The DCD, initialization seem to be ok.
Alejandro, I think we have observed something similar in the SCM module (SCM-i.MX6D was around 8x slower than the Sabre-SDB i.MX6D at similar CPU frequencies) last year (july 2016).
After the patches to the SCM-i.MX6 you have made for us the performance is comparable to the Sabre-SDB platforms.
Sorry for the delay in responding. The branch prediction is enabled (as it was in the OMAP).
I am trying to get permission to provide you with a cut-down version of our code which displays this issue (I assume you want the source). This may take a few days, I'll be in touch.
Thanks for your help so far!
The LPDDR2 is clocked at 400MHz in the OMAP4460 (same as the NXP).
I'm not sure what you mean by the "bufferable flag in L1". I am currently configuring the page tables to map the whole of the LPDDR2 using 16MB supersectors with the page table entry as follows in all entries.
; L1 and L2 Write-Back cached, Write Allocate (see DDI0333H_arm1176jzs_r0p7_trm.pdf Table 6.2 and 6.3 and Figure 6.7 in Section 6.11.2
TTB_ENTRY_SUPERSEC_NORM equ 0x55C06
What else do I need to set?
What is the frequency operation of the LPDDR2 in the OMAP?
One suggestion is that you can try to set the bufferable flag in the L1 besides the cacheable flag.
Please let me know if that makes a difference.
Do you know what configuration of the LPDDR2 you are using?
The default Uboot image works in fix mode.
I wonder if you can re-compile Uboot source code with the next defconfig?
That should change the DCD to configure the LPDDR2 in interleaving mode.
You will have to change the boot mode in SW1:
SW1-1 --> ON
SW1-2 --> OFF
SW1-3 ---> ON
Please let me know if that makes a difference.
I will look if there are other points that could help to improve the performance. Is it possible that you share your project/test so I can test that on my side?