imx6 running slow

elijahbrown · ‎01-30-2015

Hi all, I have an IMX6Q running the Freescale bare metal platform SDK. I'm using a boundary devices nitrogen6x board and so have had to modify the SDK a bit, just swapped out the IOMUX files and have only a subset of the drivers/tests compiled in. What I'm seeing is the processor seems to be running extremely slowly. For instance if I increment a counter in a loop, it takes about 1 second to count to 2,000,000. I have the freescale IPU demo hooked up to the boundary devices WVGA 480x800 display, and it takes about 800ms to paint the image up to the display (you can watch the screen 'wipe' down as the memset runs). That's a memset() over the whole frame buffer followed by a memcpy() of the freescale logo image. This is supposed to be running at 792 MHz, that ought to be so quick it's indiscernible, I suspect something is way off in the configuration but have not yet figured out what. Here's what I've tried and measured so far:

Hooked up various internal clocks to the CLKO1 and CLK02 pins and measured them with a scope. Here are my measurements:

ipg_clk_root = 66 MHz

ahb_clk_root = 132 MHz

osc_clk = 24 MHz

axi_clk_root = 264 MHz

mmdc_ch0_axi_clk_root = 528 MHz

arm_clk_root = 396 MHz.

All of these are as expected except for the arm_clk_root - it is supposed to be 792 MHz. I verified the DIV_SELECT field of the CCM_ANALOG_PLL_ARM register is set to 0x42 which should give a multiplication of 33 * 24 MHz is 792 MHz. The ARM_PODF is set to 0, (should not divide the clock) and the PLL1_SW_CLK_SEL is also set to 0. I do not understand why I'm seeing half of what I expect. However, that alone is not enough to explain the terribly slow performance.

My other suspicion as to what might cause this is incorrect DDR settings. I am running out of RAM with a segger JTAG debugger and have a GDB init script to setup the DDR. There are a lot of settings and I do not understand them all yet so any help here is appreciated. I obtained the settings from the Boundary Devices u-boot port that runs on this board. I attached my gdbinit script so you can see what it's setting. The code is linked to start at 0x10000000 so it is running entirely out of DDR. The MMU is setup is exactly the same as the freescale SDK has it - 0x10000000 to 2 GB up is mapped 1:1 virtual to physical, cache policy is kOuterInner_WB_WA.

Anything else I've missed that could be causing this? The chip does get pretty hot compared to what it did running u-boot. Perhaps that is another datapoint but I am not sure what to do with it. Thanks for any advice.

Original Attachment has been moved to: gdbinit.cfg.zip

elijahbrown · ‎02-25-2015

Sorry, but giving us a link to a benchmark that runs under Linux when I specifically said this is a bare metal application is not very helpful. The issue is in getting the part setup to run at max performance *without* a bootloader or linux in the picture. Increasing the drive strength did not make the CLKOUT pins toggle any faster so I do not know why they are reading the incorrect clock frequency, but I am convinced the ARM core is running the expected speed.

I wrote some assembly routines with a known number of instructions such that a GPIO would toggle every 50ms or so and figured out it is indeed running at close to 792 million instructions per second (I was using a bunch of subtracts to test with). Also turned on branch prediction for the ARM core which dramatically improves stuff like memcpy() since they are tight loops. Branch prediction is not on by default and with it off, it appears the pipeline is flushed every time you hit a branch - huge performance hit. If JTAG is slowing it down, it certainly isn't by very much. Branch prediction was the missing piece.

在原帖中查看解决方案

igorpadykov · ‎01-30-2015

Hi Elijah

please enable caches and try run from SD card (without jtag)

Enabling MMU and Caches on i.MX6 Series Platform SDK

Best regards

igor

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

elijahbrown · ‎02-03-2015

Thanks, I didn't think of enabling the cache, that did speed it up by about a factor of 10 and also made it run cooler. However, we are still seeing it take 65ms to do a memcpy() into the LCD framebuffer. We're using ARGB8888 pixel format, 800x480 lcd so 384000 32 bit words to copy. Even if it takes 10 instructions per word (which seems pessimistic) that's only about 60 million instructions in a second. We do not currently have a way to run from the SD card, but the JTAG adapter should not affect the speed right? If it does, we are going to have fairly serious problems using this part... The main purpose of this is to drive a display, but clearly a 7 Hz refresh rate is unacceptable, and that is only doing a memcpy of a static framebuffer. It will get much worse once we start rendering graphics in real time.

Can you shed any light as to why the arm_clk_root is half of the expected speed? Any other ideas as to how we can speed this up?

igorpadykov · ‎02-03-2015

Hi Elijah

JTAG adapter can affect the speed, please try to run code

without jtag with enabled caches. SDK was designed just

for test, not for high performance. Instead memcpy one can try

procedure as in attached file and use SDMA example mem_2_mem.c,

this may increase memory copy speed.

Best regards

igor

elijahbrown · ‎02-16-2015

Sorry for the delay, was working other priorities. We do not currently have a way to run the code without JTAG - it's a boundary devices board with SPI flash and we do not have a way to write our code to the flash yet. I understand the SDK was not designed for speed, but we are developing a bare metal application where speed is important and need some support doing so. Caches are now enabled and the clock settings are as described above so I am not sure what else to look for.

How much will the JTAG slow it down? What we are observing is not just a little slow, it's off by a couple orders of magnitude. Surely it will not slow it down that badly?

Why is the arm_clk_root (as measured by my scope) half the expected speed? I have the divider set to 8 so when the clock is going 792 MHz as it is configured for, the pin should wiggle at 99MHz. I am observing it going 49.5 MHz. Is the pin capable of going that fast?

What sort of benchmarks have you done on this part? If you can point me to something we can run bare metal I'll run it on our setup and we can more accurately quantify how far off the performance is.

igorpadykov · ‎02-16-2015

benchmarks link

LMbench Benchmarks on i.MX

regarding arm_clk_root one can divide it more with CCM_CACRR and

try to increase drive strength/slew rate with IOMUXC_SW_PAD_CTL_PAD

~igor

elijahbrown · ‎02-25-2015

Sorry, but giving us a link to a benchmark that runs under Linux when I specifically said this is a bare metal application is not very helpful. The issue is in getting the part setup to run at max performance *without* a bootloader or linux in the picture. Increasing the drive strength did not make the CLKOUT pins toggle any faster so I do not know why they are reading the incorrect clock frequency, but I am convinced the ARM core is running the expected speed.

I wrote some assembly routines with a known number of instructions such that a GPIO would toggle every 50ms or so and figured out it is indeed running at close to 792 million instructions per second (I was using a bunch of subtracts to test with). Also turned on branch prediction for the ARM core which dramatically improves stuff like memcpy() since they are tight loops. Branch prediction is not on by default and with it off, it appears the pipeline is flushed every time you hit a branch - huge performance hit. If JTAG is slowing it down, it certainly isn't by very much. Branch prediction was the missing piece.

karina_valencia · ‎02-03-2015

igorpadykov can you help to continue with the follow up?

imx6 running slow

imx6 running slow

i.MX6Quad