i.MX53 Code Alignment and Execution Speed, explanation?

TomE · ‎11-19-2013

We're running Linux on an i.MX53 board where the CPU is running at 800MHz.

With some kernels this reports in the Linux startup:

Calibrating delay loop... 531.66 BogoMIPS (lpj=2658304)

With other ones it reports:

Calibrating delay loop... 795.44 BogoMIPS (lpj=3977216)

After hours I managed to track down the difference. The BogoMIPS are calculated from the execution time of a "__delay()" function in arch/arm/lib/delay.S, which looks like this:

c01415e8 <__delay>:

c01415e8: e2500001 subs r0, r0, #1

c01415ec: 8afffffd bhi c01415fc <__delay>

c01415f0: e1a0f00e mov pc, lr

The above version of that function runs at "800 BogoMIPS", which means it is executing the "subs" and the "bhi" in the same clock cycle and managing one loop per clock. Impressive!

If the code is instead located one 32-bit word higher or lower in memory (so starting at c01415e4), it now runs at "533 BogoMIPS" which means it is taking 3 clocks to run the loop twice. Something is preventing full-speed execution.

There is lots of information in the ARM CPU manuals detailing instruction timing, and when things can and can't be double-issued, but there's nothing that I can see that explains different execution speed for different four-byte alignment of the instructions.

Does anyone know of a reference in the manuals that explains this?

This is measuring "Bogus MIPS", and it normally shouldn't matter, but we're passing the "lpj" (Loops per Jiffy) on the kernel command line to make it boot faster, and so we have to have this the same every time we build a kernel.

Tom

fabio_estevam · ‎11-20-2013

Hi Tom,

Tested this on a mx53qsb running at 1GHz with kernel 3.12 and I got:

Calibrating delay loop... 663.55 BogoMIPS (lpj=3317760)

I get almost 1000 BogoMIPS with 2.6.35 though.

It would be nice if you could post your results at [email protected] so that people could comment on this behavior.

Regards,

Fabio Estevam

TomE · ‎11-20-2013

The speed is a result of the "random" result of everything that goes into a build. Sometimes the two instructions end in the same 64-bit word and sometimes they don't. I'm guessing there must be a 64-bit pathin the instruction fetch logic somewhere that is acting as a bottleneck.

I had to add an "align" directive to arch/arm/lib/delay.S to fix the alignment for our builds. You had two kernels behaving in the two ways. I have 14 kernel builds that run "slow" and two that run "fast".

> It would be nice if you could post your results at

I don't have an account there.I see you post there a lot, so please feel free to post this observation/question on my behalf.

Someone else spotted this problem here, didn't know why it was happening and didn't get any responses:

http://thread.gmane.org/gmane.linux.ports.arm.kernel/61420

There's a huge thread here with 80 posts worrying about the "#if 0" code in delay.S on whether to remove, comment or make it a configuration item.

Gmane Loom

Tom

fabio_estevam · ‎11-20-2013

Tom,

Can you please show me the exact location where you added the .align directive?

I would like to try it here as well. It looks interesting :smileywink:

Regards,

Fabio Estevam

TomE · ‎11-20-2013

> Can you please show me the exact location where you added the .align directive?

The code is in arch/arm/lib/delay.S.

Between Kernel 3.5 and 3.6 this file was renamed to "delay-loop.S", so if you're using a new kernel you'll have to look at that file, and make sure the build is using that one and not "delay.c" and not replacing it with a timer-based one.

The function "__udelay()" falls into "__delay()" so you can't directly align the __delay() function.

I added an "align" directive as shown below. With this code that results in "__delay()" being on an ODD 4-byte boundary and it runs slow. Adding a single "nop" just before "__delay()" changes it to the "fast" behaviour.

/*

* linux/arch/arm/lib/delay.S

*

* This program is free software; you can redistribute it and/or modify

* it under the terms of the GNU General Public License version 2 as

* published by the Free Software Foundation.

*/

#include <linux/linkage.h>

#include <asm/assembler.h>

#include <asm/param.h>

.text

.align 5 /* __delay() 2/3 speed on odd-32-bit alignment */

.LC0: .word loops_per_jiffy

.LC1: .word (2199023*HZ)>>11

/*

* r0 <= 2000

* lpj <= 0x01ffffff (max. 3355 bogomips)

* HZ <= 1000

*/

ENTRY(__udelay)

ldr r2, .LC1

mul r0, r2, r0

ENTRY(__const_udelay) @ 0 <= r0 <= 0x7fffff06

mov r1, #-1

ldr r2, .LC0

ldr r2, [r2] @ max = 0x01ffffff

add r0, r0, r1, lsr #32-14

mov r0, r0, lsr #14 @ max = 0x0001ffff

add r2, r2, r1, lsr #32-10

mov r2, r2, lsr #10 @ max = 0x00007fff

mul r0, r2, r0 @ max = 2^32-1

add r0, r0, r1, lsr #32-6

movs r0, r0, lsr #6

moveq pc, lr

/*

* loops = r0 * HZ * loops_per_jiffy / 1000000

*

* Oh, if only we had a cycle counter...

*/

/* Add one or more nops in here to change the code alignment of __delay() */

@ Delay routine

ENTRY(__delay)

subs r0, r0, #1

#if 0

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

movls pc, lr

subs r0, r0, #1

#endif

bhi __delay

mov pc, lr

ENDPROC(__udelay)

ENDPROC(__const_udelay)

ENDPROC(__delay)

I got rid of the "#if, #endif" above and it got 1273 BogoMIPs at 800MHz. From the news posts mentioned previously, this was apparently for old and slow ARM CPUs.

Tom

fabio_estevam · ‎11-21-2013

Hi Tom,

I tested your suggestion here and now I get 996 BogoMIPS for mx53 running at 1GHz.

Why the '.align 5' instead of '.align 4'? On my tests if I only put the 'nop' I get the faster behaviour. Also tested 'align 4' and got the same faster result.

You don't need to have any special account to post to the ARM kernel list. Just plain text email would suffice.

Also tested on a mx6q (CortexA9) and the reported BogoMIPS did not change.

Of course I could post this there, but I think you are in much better initimacy with this piece of code.

Just post a RFC patch based on 3.12, so that people can comment. It is important if we can really close this issue.

Thanks,

Fabio Estevam

fabio_estevam · ‎11-21-2013

Tom,

Also noticed that only adding .align 4 prior to __loop delay works fine:

diff --git a/arch/arm/lib/delay-loop.S b/arch/arm/lib/delay-loop.S

index 36b668d..756337a 100644

--- a/arch/arm/lib/delay-loop.S

+++ b/arch/arm/lib/delay-loop.S

@@ -41,6 +41,7 @@ ENTRY(__loop_const_udelay) @ 0 <= r0 <= 0x7fffff06

* loops = r0 * HZ * loops_per_jiffy / 1000000

*/

+ .align 4

@ Delay routine

ENTRY(__loop_delay)

subs r0, r0, #1

Does this also work well for you?

Regards,

Fabio Estevam

TomE · ‎11-21-2013

What BogoMIPS values are you getting, and which CPU are you running the tests on? Are you getting the same results I'm getting?

> Also noticed that only adding .align 4 prior to __loop delay works fine:

That requires the linker to pad code sections with NOP instructions. I wasn't sure it would do that, and still can't find any documentation stating what the default fill is (or if and where it is overridden anywhere in the Linux Kernel build system), so I used a simple approach that I knew would work.

Testing it the way you did it shows that code sections are padded with NOPs, so that is a better approach. The following is the output of "arm-cortexa8-linux-gnueabi-objdump -S vmlinux | less" and then searching for "__delay":

c014425c <__udelay>:

c014425c: e51f200c ldr r2, [pc, #-12] ; c0144258 <PRRR+0xc109c0b0>

c0144260: e0000092 mul r0, r2, r0

c0144264 <__const_udelay>:

c0144264: e3e01000 mvn r1, #0

c0144268: e51f201c ldr r2, [pc, #-28] ; c0144254 <PRRR+0xc109c0ac>

c014426c: e5922000 ldr r2, [r2]

c0144270: e0800921 add r0, r0, r1, lsr #18

c0144274: e1a00720 lsr r0, r0, #14

c0144278: e0822b21 add r2, r2, r1, lsr #22

c014427c: e1a02522 lsr r2, r2, #10

c0144280: e0000092 mul r0, r2, r0

c0144284: e0800d21 add r0, r0, r1, lsr #26

c0144288: e1b00320 lsrs r0, r0, #6

c014428c: 01a0f00e moveq pc, lr

c0144290: e320f000 nop {0}

c0144294: e320f000 nop {0}

c0144298: e320f000 nop {0}

c014429c: e320f000 nop {0}

c01442a0 <__delay>:

c01442a0: e2500001 subs r0, r0, #1

c01442a4: 8afffffd bhi c01442a0 <__delay>

c01442a8: e1a0f00e mov pc, lr

Tom

fabio_estevam · ‎11-21-2013

Tom,

I tested on mx51 and mx53 and I get the same results that you reported. Running mx53 at 1GHz just scales linearly.

I would like to send a RFC patch to the ARM list and would like to know if I can add your Reported-by flag.

If you are OK with that I will contact you offline so that you provide me your email address.

Regards,

Fabio Estevam

TomE · ‎11-22-2013

> I would like to send a RFC patch to the ARM list

And you have. Thanks for doing that. Going back a bit...

> Why the '.align 5' instead of '.align 4'?

I was also testing larger alignments (8 bytes, 16 bytes, 32 bytes) to see if that made any difference. That code was also ending up near/over a cache-line boundary and I wanted to avoid that.

Would you believe I need guaranteed "pessimal" behavior? I need that code to always run in the slow mode as we've shipped lots of devices with a hardcoded "lpj=nnnn" on the Kernel Command Line that is stored in the Redboot Bootstrap configuration. We can upgrade the Apps and the Kernel in the field, but we can't change that command line. Why did we override this? To make it boot faster. All of 60ms faster.

> Just post a RFC patch based on 3.12,...

Unlikely to affect us or any of your customers, as Freescale is using 2.6 for i.MX53 and 3.0 for i.MX6. We're using code based off 3.4. This change is unlikely to get backported.

Tom

TomE · ‎11-26-2013

I wrote:

> Fabio wrote:

> > I would like to send a RFC patch to the ARM list

> And you have. Thanks for doing that. Going back a bit..

No takers. it looks like the post hasn't got any responses.

So it is still a mystery as to why the CPU does this and whether it has any other implications.

Tom

fabio_estevam · ‎11-26-2013

This is normal. We should allow more time for people to comment.

If no comment is given after a week, then I will ping.

Regards,

Fabio Estevam

fabio_estevam · ‎11-30-2013

Hi Tom,

Good news. please check: 'Re: [RFC] ARM: lib: delay-loop: Add align directive to fix BogoMIPS calculation' - MARC

Russell is happy with the patch and I will submit it to his patch system.

Would it be possible for you to run the benchmark that Russell suggested?

Thanks,

Fabio Estevam

TomE · ‎11-30-2013

> Would it be possible for you to run the benchmark that Russell suggested?

Not easily. I'd have to compile from source as we're not running a standard "Package" system. It could take days to get it running.

Hackbench is a VERY high level system benchmark. We have found a looped pair of instructions where the CPU execution speed depends on alignment. The alignment of instruction pairs within function are effectively "random", so for every pair that might get faster with that compiler alignment I'd expect the other 50% (aligned the other way) would get slower and cancel out in the benchmark results. The code size increase of that extra alignment would also make the code slower due to cache issues, so I'd expect hackbench to slow slightly slower results with this change.

We know from our testing what this alignment does to the BogoMIPS. We don't know why the ARM CPU is doing this. It is probably so esoteric and artificial (a two instruction loop can't do any useful work really) that it doesn't affect anything except this very public BogoMIPS value. So perhaps "interesting but not really important".

Tom

YixingKong · ‎02-18-2014

Tom

Had your issue got resolved? If yes, we are going to close the discussion in 3 days. If you still need help please feel free to contact Freescale.

Thanks,
Yixing

fabio_estevam · ‎02-19-2014

Yixing,

I have sent a fix for this, which already hit the mainline kernel:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/arch/arm/lib/delay-loop....

Regards,

Fabio Estevam

i.MX53 Code Alignment and Execution Speed, explanation?

i.MX53 Code Alignment and Execution Speed, explanation?

i.MX51

i.MX53

Linux