Performance problems on SCM-i.MX6Q (as compared to TI OMAP4460)

neilturner · ‎03-23-2017

Hi,

I am working on a bare metal project which involves running a real time protocol encode/decode application. The project originally started using TI’s OMAP4460 device (containing a dual core ARM Cortex A9 clocked at 1.2GHz) but for a number of reasons (mostly hardware related) we have move to the SCM-iMX6Q. I am evaluating the performance of the SCM-i.MX6Q using a QWKS-SCMIMX6 off the shelf development board, comparing it to the OMAP.

Our evaluation involves running some sample protocol encode/decode routines on arrays of data in memory (so does no external I/O). It all runs on a single ARM core, the others beings disabled. We only have access to the object libraries for this evaluation code (which is provided by a partner company in the project) this is built using TI Code Composer, and is in fact the same object code which I can run on the OMAP or the SCM-i.MX6Q. (i.e. I can link exactly the same libraries into my OMAP project as my SCM-i.MX6Q project). In both cases the device initialisations have come from the standard U-Boot sources for each device type, (clock and memory configurations, DCD configuration etc.).

In the SCM-iMX6Q the ARM is clocked at 800MHz and the OMAP at 1200MHz, the OMAP also uses the same PoP LPDDR2 RAM as the SCM-i.MX6Q, so I would expect running the same code in the same circumstance I would see roughly two thirds of the performance of the OMAP when running on the SCM-iMX6Q. Unfortunately the performance difference I see is huge,

OMAP4460 Performance:

Overall decoding and encoding finished: 16757906 = 16.757mS

NXP Performance:

Overall decoding and encoding finished: 136581353 = 136.581mS

The OMAP is more than 8 times faster! These timing are taken using the internal ARM performance counter. The test routines run exclusively on the processor with nothing else running and interrupts disabled, so it is pure “number crunching”. The test is very processor and memory intensive.

I have the L1 I/D caches enabled, the L2 cache is enabled, and the MMI is configured to map all of the LPDDR2 RAM addresses as cacheable (TTB_ENTRY_SUPERSEC_NORM equ 0x55C06). The clock settings appear to match what I see if I boot Linux then stop in U-Boot and display the clock settings. Using the same display code from U-Boot built into my project (after my initialisation of the hardware) I see these clock settings, which match what I see if U-Boot does the initialisation.

Clock Settings:

PLL_SYS 792 MHz

PLL_BUS 528 MHz

PLL_OTG 480 MHz

PLL_NET 50 MHz

ARMCLK 792000 kHz

IPG 66000 kHz

UART 80000 kHz

CSPI 60000 kHz

AHB 132000 kHz

AXI 198000 kHz

DDR 396000 kHz

USDHC1 198000 kHz

USDHC2 198000 kHz

USDHC3 198000 kHz

USDHC4 198000 kHz

EMI SLOW 99000 kHz

IPG PERCLK 66000 kHz

I am beginning to think that the problem has something to do with the L1 cache in the SCM.iMX6Q. If I do not enable the L1 cache in the SCM.iMX6Q I see only a small amount of difference in the performance, however if I do the same in my test using the OMAP there is a huge difference in performance (the OMAPs encode/decode times become 85mS). Is there something I am missing about configuring the L1 cache which is different to the ARM in the OMAP?

Clearly I am using difference build environments, Code Composer for the OMAP and IAR Workbench for the SCM-iMX6Q. So just in case the different C libraries were the cause of the problem I have tried building my project for the SCM-iMX6Q using TI’s C library instead of the one provided by IAR. It makes no difference at all to the timings.

I have been investigating this problems for some time now and am really running out of ideas as to why there is such a large difference in performance. Either there is some device configuration I have overlooked or there is really a big difference in the architecture between these two devices which is beyond my control. Any help would be much appreciated!

Regards,

Neil

Original Attachment has been moved to: Boot_DCD.c.zip

neilturner · ‎03-25-2017

Hi Alejandro,

My Page Table Entries are actually using a different way to define the cache control which is not shown in your table. There is one more entry for the TEX field as follows.

TEX C B

1BB A A Cached memory

BB = Outer ploicy

AA = Inner policy

See Table 6-3

Table 6-3 then defines,

BB or AA bits Cache policy

b00 Noncacheabe

b01 Write-Back cached, Write Allocate

b10 Write-Through cached, No Write on Allocate

b11 Write-Back cached, No Write on Allocate

So my entry of 0x55C06 defines both inner and outer as "Write-Back cached, Write on Allocate". This is also what I used on the OMAP.

I tried changing to 0x50C0E which would set the TEX=b000, C=1, B=1 and it makes no difference to the speed. If I change to 0x50C0A, TEX=b000, C=1, B=0 then my test runs much slower (by about half, 270ms).

Regards,

Neil

alejandrolozan1 · ‎03-27-2017

Hi Neil,

It seems the bufferable option was enabled already. One last thing, did you enable the branch prediction in the cortex?

This helps to improve the performance in routines that consist of copying big chuncks of data.

Without the code and a way to test on my side it is not easy to find the root cause. Is it possible to share the code? The DCD, initialization seem to be ok.

Best Regards,

Alejandro

michaelguntli · ‎05-26-2017

alejandrolozano‌:

Alejandro, I think we have observed something similar in the SCM module (SCM-i.MX6D was around 8x slower than the Sabre-SDB i.MX6D at similar CPU frequencies) last year (july 2016).

After the patches to the SCM-i.MX6 you have made for us the performance is comparable to the Sabre-SDB platforms.

Interleaved LPDDR2 configuration
WA GPU3D OT patch (reduced bus priority of GPU3D so that it didn't preempt other bus participants that often)
--> since Neil is using it bare metal it shouldn't have an impact

neilturner · ‎03-29-2017

Hi Alejandro,

Sorry for the delay in responding. The branch prediction is enabled (as it was in the OMAP).

I am trying to get permission to provide you with a cut-down version of our code which displays this issue (I assume you want the source). This may take a few days, I'll be in touch.

Thanks for your help so far!

Regards,

Neil

neilturner · ‎03-24-2017

Hi Alejandro,

The LPDDR2 is clocked at 400MHz in the OMAP4460 (same as the NXP).

I'm not sure what you mean by the "bufferable flag in L1". I am currently configuring the page tables to map the whole of the LPDDR2 using 16MB supersectors with the page table entry as follows in all entries.

; L1 and L2 Write-Back cached, Write Allocate (see DDI0333H_arm1176jzs_r0p7_trm.pdf Table 6.2 and 6.3 and Figure 6.7 in Section 6.11.2

TTB_ENTRY_SUPERSEC_NORM equ 0x55C06

What else do I need to set?

Regards,

Neil

alejandrolozan1 · ‎03-24-2017

Hi Neil,

In Table 6-2 you mentioned already, there is bit B. That bit enables the bufferable flag.

Please let me know how it goes.

Have a good day,

Alejandro

neilturner · ‎03-23-2017

Hi Alejandro,

Thanks for your response.

My configuration is already setting the memory as 4kB interleaved (I think!) I have,

SW1-2 --> OFF

SW1-3 ---> ON

I have used the DCD data from the U-Boot file imximage_csm_lpddr2.cfg and created my own initialization table from it this is then processed by my startup code (running in OCRAM) rather than being done by the Boot ROMs directly. This uses the DCD data which has CONFIG_INTERLEAVING_MODE defined and with neither CONFIG_SCM_LPDDR2_512MB or CONFIG_SCM_LPDDR2_2GB defined (so I get the data for 1GB). I have attached my boot_DCD.c initialization code to the original post above.

Unfortunately since we do not own the object libraries used in the test it is difficult for me to make my application available for you.

Regards,

Neil

alejandrolozan1 · ‎03-23-2017

Hi again,

What is the frequency operation of the LPDDR2 in the OMAP?

One suggestion is that you can try to set the bufferable flag in the L1 besides the cacheable flag.

Please let me know if that makes a difference.

Best Regards,

Alejandro

alejandrolozan1 · ‎03-23-2017

Hi Neil,

Do you know what configuration of the LPDDR2 you are using?

The default Uboot image works in fix mode.

I wonder if you can re-compile Uboot source code with the next defconfig?

mx6dqscm_1gb_interleaving_qwks_rev2_defconfig

That should change the DCD to configure the LPDDR2 in interleaving mode.

You will have to change the boot mode in SW1:

Make sure

SW1-1 --> ON

SW1-2 --> OFF

SW1-3 ---> ON

Please let me know if that makes a difference.

I will look if there are other points that could help to improve the performance. Is it possible that you share your project/test so I can test that on my side?

Best Regards,

Alejandro

jamesbone · ‎03-23-2017

alejandrolozano‌, Can you please take a look on this thread?

Performance problems on SCM-i.MX6Q (as compared to TI OMAP4460)

Performance problems on SCM-i.MX6Q (as compared to TI OMAP4460)

SCM-i.MX6DQ