i.MX53 Code Alignment and Execution Speed, explanation?

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

i.MX53 Code Alignment and Execution Speed, explanation?

1,936 Views
TomE
Specialist II

We're running Linux on an i.MX53 board where the CPU is running at 800MHz.

With some kernels this reports in the Linux startup:

Calibrating delay loop... 531.66 BogoMIPS (lpj=2658304)

With other ones it reports:

Calibrating delay loop... 795.44 BogoMIPS (lpj=3977216)

After hours I managed to track down the difference. The BogoMIPS are calculated from the execution time of a "__delay()" function in arch/arm/lib/delay.S, which looks like this:

c01415e8 <__delay>:

c01415e8:       e2500001        subs    r0, r0, #1

c01415ec:       8afffffd        bhi     c01415fc <__delay>

c01415f0:       e1a0f00e        mov     pc, lr

The above version of that function runs at "800 BogoMIPS", which means it is executing the "subs" and the "bhi" in the same clock cycle and managing one loop per clock. Impressive!

If the code is instead located one 32-bit word higher or lower in memory (so starting at c01415e4), it now runs at "533 BogoMIPS" which means it is taking 3 clocks to run the loop twice. Something is preventing full-speed execution.

There is lots of information in the ARM CPU manuals detailing instruction timing, and when things can and can't be double-issued, but there's nothing that I can see that explains different execution speed for different four-byte alignment of the instructions.

Does anyone know of a reference in the manuals that explains this?

This is measuring "Bogus MIPS", and it normally shouldn't matter, but we're passing the "lpj" (Loops per Jiffy) on the kernel command line to make it boot faster, and so we have to have this the same every time we build a kernel.

Tom

Labels (3)
Tags (4)
0 Kudos
15 Replies

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Hi Tom,

Tested this on a mx53qsb running at 1GHz with kernel 3.12 and I got:

Calibrating delay loop... 663.55 BogoMIPS (lpj=3317760)

I get almost 1000 BogoMIPS with 2.6.35 though.

It would be nice if you could post your results at linux-arm-kernel@lists.infradead.org so that people could comment on this behavior.

Regards,

Fabio Estevam

0 Kudos

1,132 Views
TomE
Specialist II

The speed is a result of the "random" result of everything that goes into a build. Sometimes the two instructions end in the same 64-bit word and sometimes they don't. I'm guessing there must be a 64-bit pathin the instruction fetch logic somewhere that is acting as a bottleneck.

I had to add an "align" directive to arch/arm/lib/delay.S to fix the alignment for our builds. You had two kernels behaving in the two ways. I have 14 kernel builds that run "slow" and two that run "fast".

> It would be nice if you could post your results at

I don't have an account there.I see you post there a lot, so please feel free to post this observation/question on my behalf.

Someone else spotted this problem here, didn't know why it was happening and didn't get any responses:

http://thread.gmane.org/gmane.linux.ports.arm.kernel/61420

There's a huge thread here with 80 posts worrying about the "#if 0" code in delay.S on whether to remove, comment or make it a configuration item.

Gmane Loom

Tom

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Tom,

Can you please show me the exact location where you added the .align directive?

I would like to try it here as well. It looks interesting :smileywink:

Regards,

Fabio Estevam

0 Kudos

1,132 Views
TomE
Specialist II

> Can you please show me the exact location where you added the .align directive?

The code is in arch/arm/lib/delay.S.

Between Kernel 3.5 and 3.6 this file was renamed to "delay-loop.S", so if you're using a new kernel you'll have to look at that file, and make sure the build is using that one and not "delay.c" and not replacing it with a timer-based one.

The function "__udelay()" falls into "__delay()" so you can't directly align the __delay() function.

I added an "align" directive as shown below. With this code that results in "__delay()" being on an ODD 4-byte boundary and it runs slow. Adding a single "nop" just before "__delay()" changes it to the "fast" behaviour.

/*

*  linux/arch/arm/lib/delay.S

*

*  Copyright (C) 1995, 1996 Russell King

*

* This program is free software; you can redistribute it and/or modify

* it under the terms of the GNU General Public License version 2 as

* published by the Free Software Foundation.

*/

#include <linux/linkage.h>

#include <asm/assembler.h>

#include <asm/param.h>

                .text

                .align 5  /* __delay() 2/3 speed on odd-32-bit alignment */

.LC0:          .word  loops_per_jiffy

.LC1:          .word  (2199023*HZ)>>11

/*

* r0  <= 2000

* lpj <= 0x01ffffff (max. 3355 bogomips)

* HZ  <= 1000

*/

ENTRY(__udelay)

                ldr    r2, .LC1

                mul    r0, r2, r0

ENTRY(__const_udelay)                          @ 0 <= r0 <= 0x7fffff06

                mov    r1, #-1

                ldr    r2, .LC0

                ldr    r2, [r2]                @ max = 0x01ffffff

                add    r0, r0, r1, lsr #32-14

                mov    r0, r0, lsr #14        @ max = 0x0001ffff

                add    r2, r2, r1, lsr #32-10

                mov    r2, r2, lsr #10        @ max = 0x00007fff

                mul    r0, r2, r0              @ max = 2^32-1

                add    r0, r0, r1, lsr #32-6

                movs    r0, r0, lsr #6

                moveq  pc, lr

/*

* loops = r0 * HZ * loops_per_jiffy / 1000000

*

* Oh, if only we had a cycle counter...

*/

/* Add one or more nops in here to change the code alignment of __delay() */

@ Delay routine

ENTRY(__delay)

                subs    r0, r0, #1

#if 0

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

                movls  pc, lr

                subs    r0, r0, #1

#endif

                bhi    __delay

                mov    pc, lr

ENDPROC(__udelay)

ENDPROC(__const_udelay)

ENDPROC(__delay)

I got rid of the "#if, #endif"  above and it got 1273 BogoMIPs at 800MHz. From the news posts mentioned previously, this was apparently for old and slow ARM CPUs.

Tom

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Hi Tom,

I tested your suggestion here and now I get 996 BogoMIPS for mx53 running at 1GHz.

Why the '.align 5' instead of '.align 4'? On my tests if I only put the 'nop' I get the faster behaviour. Also tested 'align 4' and got the same faster result.

You don't need to have any special account to post to the ARM kernel list. Just plain text email would suffice.

Also tested on a mx6q (CortexA9) and the reported BogoMIPS did not change.

Of course I could post this there, but I think you are in much better initimacy with this piece of code.

Just post a RFC patch based on 3.12, so that people can comment. It is important if we can really close this issue.

Thanks,

Fabio Estevam

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Tom,

Also noticed that only adding .align 4 prior to __loop delay works fine:

diff --git a/arch/arm/lib/delay-loop.S b/arch/arm/lib/delay-loop.S

index 36b668d..756337a 100644

--- a/arch/arm/lib/delay-loop.S

+++ b/arch/arm/lib/delay-loop.S

@@ -41,6 +41,7 @@ ENTRY(__loop_const_udelay)                    @ 0 <= r0 <= 0x7fffff06

  * loops = r0 * HZ * loops_per_jiffy / 1000000

  */

+               .align 4

@ Delay routine

ENTRY(__loop_delay)

                subs    r0, r0, #1

Does this also work well for you?

Regards,

Fabio Estevam

0 Kudos

1,132 Views
TomE
Specialist II

What BogoMIPS values are you getting, and which CPU are you running the tests on? Are you getting the same results I'm getting?

> Also noticed that only adding .align 4 prior to __loop delay works fine:

That requires the linker to pad code sections with NOP instructions. I wasn't sure it would do that, and still can't find any documentation stating what the default fill is (or if and where it is overridden anywhere in the Linux Kernel build system), so I used a simple approach that I knew would work.

Testing it the way you did it shows that code sections are padded with NOPs, so that is a better approach. The following is the output of "arm-cortexa8-linux-gnueabi-objdump -S vmlinux | less" and then searching for "__delay":

c014425c <__udelay>:

c014425c:       e51f200c        ldr     r2, [pc, #-12]  ; c0144258 <PRRR+0xc109c0b0>

c0144260:       e0000092        mul     r0, r2, r0

c0144264 <__const_udelay>:

c0144264:       e3e01000        mvn     r1, #0

c0144268:       e51f201c        ldr     r2, [pc, #-28]  ; c0144254 <PRRR+0xc109c0ac>

c014426c:       e5922000        ldr     r2, [r2]

c0144270:       e0800921        add     r0, r0, r1, lsr #18

c0144274:       e1a00720        lsr     r0, r0, #14

c0144278:       e0822b21        add     r2, r2, r1, lsr #22

c014427c:       e1a02522        lsr     r2, r2, #10

c0144280:       e0000092        mul     r0, r2, r0

c0144284:       e0800d21        add     r0, r0, r1, lsr #26

c0144288:       e1b00320        lsrs    r0, r0, #6

c014428c:       01a0f00e        moveq   pc, lr

c0144290:       e320f000        nop     {0}

c0144294:       e320f000        nop     {0}

c0144298:       e320f000        nop     {0}

c014429c:       e320f000        nop     {0}

c01442a0 <__delay>:

c01442a0:       e2500001        subs    r0, r0, #1

c01442a4:       8afffffd        bhi     c01442a0 <__delay>

c01442a8:       e1a0f00e        mov     pc, lr

Tom

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Tom,

I tested on mx51 and mx53 and I get the same results that you reported. Running mx53 at 1GHz just scales linearly.

I would like to send a RFC patch to the ARM list and would like to know if I can add your Reported-by flag.

If you are OK with that I will contact you offline so that you provide me your email address.

Regards,

Fabio Estevam

0 Kudos

1,132 Views
TomE
Specialist II

> I would like to send a RFC patch to the ARM list

And you have. Thanks for doing that. Going back a bit...

> Why the '.align 5' instead of '.align 4'?

I was also testing larger alignments (8 bytes, 16 bytes, 32 bytes) to see if that made any difference. That code was also ending up near/over a cache-line boundary and I wanted to avoid that.

Would you believe I need guaranteed "pessimal" behavior? I need that code to always run in the slow mode as we've shipped lots of devices with a hardcoded "lpj=nnnn" on the Kernel Command Line that is stored in the Redboot Bootstrap configuration. We can upgrade the Apps and the Kernel in the field, but we can't change that command line. Why did we override this? To make it boot faster. All of 60ms faster.

> Just post a RFC patch based on 3.12,...

Unlikely to affect us or any of your customers, as Freescale is using 2.6 for i.MX53 and 3.0 for i.MX6. We're using code based off 3.4. This change is unlikely to get backported.

Tom

0 Kudos

1,132 Views
TomE
Specialist II

I wrote:

> Fabio wrote:

> > I would like to send a RFC patch to the ARM list

> And you have. Thanks for doing that. Going back a bit..

No takers. it looks like the post hasn't got any responses.

So it is still a mystery as to why the CPU does this and whether it has any other implications.

Tom

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

This is normal. We should allow more time for people to comment.

If no comment is given after a week, then I will ping.

Regards,

Fabio Estevam

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Hi Tom,

Good news. please check: 'Re: [RFC] ARM: lib: delay-loop: Add align directive to fix BogoMIPS calculation' - MARC

Russell is happy with the patch and I will submit it to his patch system.

Would it be possible for you to run the benchmark that Russell suggested?

Thanks,

Fabio Estevam

0 Kudos

1,132 Views
TomE
Specialist II

> Would it be possible for you to run the benchmark that Russell suggested?

Not easily. I'd have to compile from source as we're not running a standard "Package" system. It could take days to get it running.

Hackbench is a VERY high level system benchmark. We have found a looped pair of instructions where the CPU execution speed depends on alignment. The alignment of instruction pairs within function are effectively "random", so for every pair that might get faster with that compiler alignment I'd expect the other 50% (aligned the other way) would get slower and cancel out in the benchmark results. The code size increase of that extra alignment would also make the code slower due to cache issues, so I'd expect hackbench to slow slightly slower results with this change.

We know from our testing what this alignment does to the BogoMIPS. We don't know why the ARM CPU is doing this. It is probably so esoteric and artificial (a two instruction loop can't do any useful work really) that it doesn't affect anything except this very public BogoMIPS value. So perhaps "interesting but not really important".

Tom

0 Kudos

1,132 Views
YixingKong
Senior Contributor IV

Tom

Had your issue got resolved? If yes, we are going to close the discussion in 3 days. If you still need help please feel free to contact Freescale.

Thanks,
Yixing

0 Kudos

1,132 Views
fabio_estevam
NXP Employee
NXP Employee

Yixing,

I have sent a fix for this, which already hit the mainline kernel:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/arch/arm/lib/delay-loop....

Regards,

Fabio Estevam

0 Kudos