GCC Clobbering Branch Predict Bit in CF3 Cores

TomE · ‎05-18-2014

I've searched for this apparent bug on gcc.gnu.org/bugzilla, but didn't find anything.

Many of those reading this will be using Code Warrior, and I assume this doesn't apply to you. I'd be interested in seeing the compiler's generated code for bit testing to see if it is every doing this though.

I'm using:

$ m68k-elf-gcc --version

m68k-elf-gcc.exe (Sourcery G++ Lite 4.3-208) 4.3.3

The command line is:

m68k-elf-gcc -MD -MF adl3.d0 -gdwarf-2 -mcpu=5329 -Wall -std=c99 -g -Os ...

Note the "-mcpu=5329" line.

A small snippet of source code:

#define cd_BUS_OFF 0x2000

#define cd_BUS_OFF 0x1000

#define cd_OVERRUN 0x0008

else if (cc->status & cd_BUS_OFF) {

40109020: 202a 003c movel %a2@(60),%d0

40109024: 0800 000d btst #13,%d0

40109028: 6710 beqs 4010903a <comm_check_status+0x58>

(other code)

} else if (cc->status & cd_BUS_RWARN) {

4010903a: 0800 000c btst #12,%d0

4010903e: 6604 bnes 40109044 <comm_check_status+0x62>

} else if (cc->status & cd_OVERRUN) {

40109040: 44c0 movew %d0,%ccr

40109042: 6a02 bpls 40109046 <comm_check_status+0x64>

Note the weird word-saving trick in the last compare? It copies the data to the CCR and then tests the "N" bit.

It only seems to do this trick for the "N" and "Z" bits (bits 2 and 3). it should be able to do it for "C" and "V" (0 and 1), but I haven't seen any code doing this. I've also seen code like:

4012193e: 44c2 movew %d2,%ccr

40121940: 57c0 seq %d0

40121942: 44c2 movew %d2,%ccr

40121944: 5bc1 smi %d1

What's the problem? Bit 7 of the CCR is documented in the CFPRM and MCF5329 Reference Manual as:

Branch prediction (Version 3 only). Alters the static

prediction algorithm used by the branch acceleration

logic in the instruction fetch pipeline on forward

conditional branches. Refer to a V3 core or device

user’s manual for further information on this bit.

So the use of the "movew %d0,%ccr" instruction is changing the CPU's branch prediction randomly. It shouldn't be doing that.

Does anyone know if this was ever fixed?

Tom

TomE · ‎05-21-2014

So why is this a problem?

With the prediction bit changed, all forward branches (exits out of a loop) will be mis-predicted as taken. That makes them take FIVE times as long to execute as when predicted as not taken.

This could make fast inner loops have variable execution times depending on what the last bit-test instruction data had in the 8th bit.

From the CodeSourcery Mailing List I received the following suggestion:

Have you tried instructing GCC to avoid the use of the CCR register
using "-ffixed-ccr"?

Thank you for that suggestion. Unfortunately:

$ m68k-elf-gcc -c -Os -ffixed-ccr -o mmOsccr m68kbits.c

cc1.exe: warning: unknown register name: ccr

That option works with other register names like "d0", "sp", "a1".

It still did this:

1a: 44c1 movew %d1,%ccr

1c: 6a02 bpls 20 <main+0x20>

it doesn't help that the libraries are all full of this construct.

$ m68k-elf-objdump -S libc.a | grep ccr | wc

89 356 3470

So even if you try and actively bypass or correct for this corruption, lots of the standard C library functions will corrupt it back for you.

Then another exchange:

Hmm, too bad.
If it's really an "over-optimization" of the compiler you can try to
disable optimizations (-O0) and check whether the problem persists.

The code gets horrible without any optimisations:

-O2:

c: 0801 0001 btst #1,%d1

10: 6702 beqs 14 <main+0x14>

No optimisation, note the TWO redundant instructions.

8: 7001 moveq #1,%d0

a: c0ae 0008 andl %fp@(8),%d0

e: 1000 moveb %d0,%d0

10: 4a00 tstb %d0

12: 6704 beqs 18 <main+0x18>

If it doesn't, turn optimizations back on

Another good suggestion. I'll try that tomorrow.

The problem is with whoever put this misguided "feature" into the CF3 Condition Code Register. They should have

known at the time that compilers were free to use this trick, and probably had been doing this since the 68000.

The feature should have been in a different special register, or been able to be enabled and disabled in a

different control register.

I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

int bitcount(int bits)

{

int nRes = 0;

if (bits & 0x01) nRes++;

if (bits & 0x02) nRes++;

if (bits & 0x04) nRes++;

if (bits & 0x08) nRes++;

if (bits & 0x10) nRes++;

return nRes;

}

Compiled with the following gcc command line this results in:

$ m68k-elf-gcc --version

m68k-elf-gcc.exe (Sourcery G++ Lite 4.3-208) 4.3.3

$ m68k-elf-gcc -c -mcpu=5235 -O2 -o bits bits.c

$ m68k-elf-objdump -S bits

00000000 <bitcount>:

0: 7001 moveq #1,%d0

2: 4e56 0000 linkw %fp,#0

6: 222e 0008 movel %fp@(8),%d1

a: c081 andl %d1,%d0

c: 0801 0001 btst #1,%d1

10: 6702 beqs 14 <bitcount+0x14>

12: 5280 addql #1,%d0

14: 44c1 movew %d1,%ccr

16: 6602 bnes 1a <bitcount+0x1a>

18: 5280 addql #1,%d0

1a: 44c1 movew %d1,%ccr

1c: 6a02 bpls 20 <bitcount+0x20>

1e: 5280 addql #1,%d0

20: 0801 0004 btst #4,%d1

24: 6702 beqs 28 <bitcount+0x28>

26: 5280 addql #1,%d0

28: 4e5e unlk %fp

2a: 4e75 rts

The GREEN lines are using "normal" bit tests that don't corrupt the ccr. The RED ones are using the one-word "shortcut".

Tom

matthey · ‎03-30-2015

I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

int bitcount(int bits)

{

        int nRes = 0;

        if (bits & 0x01) nRes++;

        if (bits & 0x02) nRes++;

        if (bits & 0x04) nRes++;

        if (bits & 0x08) nRes++;

        if (bits & 0x10) nRes++;

        return nRes;

}

Using vbcc I get:

_bitcount:

subq.l #4,sp

movem.l d2,(sp)

move.l (8,sp),d2

moveq #0,d1

moveq #1,d0

and.l d2,d0

beq.b lab_3e8

moveq #1,d1

lab_3e8:

moveq #2,d0

and.l d2,d0

beq.b lab_3f0

addq.l #1,d1

lab_3f0:

moveq #4,d0

and.l d2,d0

beq.b lab_3f8

addq.l #1,d1

lab_3f8:

moveq #8,d0

and.l d2,d0

beq.b lab_400

addq.l #1,d1

lab_400:

moveq #$10,d0

and.l d2,d0

beq.b lab_408

addq.l #1,d1

lab_408:

move.l d1,d0

movem.l (sp),d2

addq.l #4,sp

rts

I count 62 bytes so it did ok in size and avoids writing to the CCR. Performance would be better on a superscalar CPU like the 68060 or CF v5 with early instruction completion and forwarding of MOVEQ.

A couple of notes about vbcc:

It uses the stack instead of a stack frame by default as it generates better code (especially on the 68k with a MOVEM supporting pre-decrement and post-increment).

Vbcc has the best 68k (and maybe ColdFire) peephole optimizing assembler ever in vasm.

The vclib link library and inlines for vbcc are written by 68k enthusiasts and are becoming more optimized. Send in your optimized CF code to have it included.

The 68k/ColdFire backend is too simple and needs improvement but bugs are fixed. There is no instruction scheduler yet.

The original idea of vbcc was to let the peephole optimizing assembler optimize instructions like MOVEM->MOVE above but it can't for a data register because of the CC flags.

The source code for vbcc, vasm and vlink are online and there are few dependencies. I compile them with themselves on my 68060 Amiga.

I was able to compile with -cpu=5329 to generate CF code into my Amiga Hunk format executable as the backend for 68k and ColdFire is shared.

Vbcc was originally targeted at embedded but has become popular for retro computers and processors also. C99 support is pretty good and improving.

The Atari (ColdFire) Firebee is supported and an Amiga FPGA 68k processor should be mostly ColdFire compatible and at least partially supported.

TomE · ‎04-05-2015

Thanks for providing that example. Gcc is being a little "too smart" here.

The most amazing piece of code I've seen an older version of gcc generate was a "multiply by three" in ONE instruction WITHOUT using the "MUL" or (the more usual) shift and add sequences "MUL" took 70 clocks on the 68000 and 41 clocks on a 68020. Shifts and adds took a minimum of 3 instructions and needed another register. Instead it generated an "lea (a1,a1.l*2),a1" instruction!

> I count 62 bytes so it did ok in size and avoids writing to the CCR.

If gcc uses the proper "btst" throughout the function would be 46 bytes. That's 74% of what vbcc managed. Vbcc also takes 4 16-bit instructions instead of three (two 16-bit, one 32-bit) which would be faster on most CPUs.

I've just noticed that this corruption of the MCF53xx branch prediction is bad, but not as bad as it could be. It "only" reverses the prediction of Forward branches. The prediction of backward branches is unchanged. Still, that would slow down a lot of code.

Tom

TomE · ‎04-24-2017

I'm profiling some code to see why the serial port transmit interrupt routine is taking so much longer than it should. I was able to get it 30% faster by getting rid of a common interrupt routine that was being called with "interface indices" by the service routines for UARTS 0, 1 and 2 because the code was spending way too much time indexing various arrays. Then the profiling showed the CPU was spending 8 times longer one one instruction than I expected it should. The specific instruction is "movew %d1, %ccr". So why is that (seemingly) taking so much longer?

Gcc is doing that to replace a two-word "btst" instruction with a one-word instruction. That might be a bit faster on a 68000 [1], but on a ColdFire, the immediate "btst" instruction takes ONE clock, so can't be replaced by anything faster.

The replacement "movew %d1,%ccr" instruction takes (now reading the Reference Manuals properly...) ONE clock on MCF51, MCF52, MCF53 and MCF54. So no difference (except being two bytes shorter and corrupting Branch Predict on MCF53).

Note 1: And how about the older chips? Surely this "trick" was written when going from 4 bytes to 2 bytes (for the instruction change) meant something?

68000 and 68010 took 10 clocks for BTST and 12 clocks for the CCR replacement. So the replacement would be slower.

68EC020 takes one clock for BTST and 4 clocks for the CCR replacement. Slower again.

So why do my measurement show it taking 8 times longer on that instruction? That's probably something weird in the profiling.

Tom

yibbidy · ‎03-30-2015

I know this was an old post but thought that you might still be interested.

First of all I tried my later version of Codesourcery GCC and it has the identical output that you have. My version is:

m68k-elf-cpp (Sourcery CodeBench 2012.09-80) 4.7.2

Using the Codewarrior 10.2 for Linux which uses the Metrowerks compiler, with no optimisations:

0x00000000 _bitcount:

0x00000000 0x4E560000 link a6,#0

0x00000004 0x518F subq.l #8,a7

0x00000006 0x2D40FFF8 move.l d0,-8(a6)

0x0000000A 0x7000 moveq #0,d0

0x0000000C 0x2D40FFFC move.l d0,-4(a6)

0x00000010 0x202EFFF8 move.l -8(a6),d0

0x00000014 0x08000000 btst #0,d0

0x00000018 0x670A beq.s *+12 ; 0x00000024

0x0000001A 0x202EFFFC move.l -4(a6),d0

0x0000001E 0x5280 addq.l #1,d0

0x00000020 0x2D40FFFC move.l d0,-4(a6)

0x00000024 0x202EFFF8 move.l -8(a6),d0

0x00000028 0x08000001 btst #1,d0

0x0000002C 0x670A beq.s *+12 ; 0x00000038

0x0000002E 0x202EFFFC move.l -4(a6),d0

0x00000032 0x5280 addq.l #1,d0

0x00000034 0x2D40FFFC move.l d0,-4(a6)

0x00000038 0x202EFFF8 move.l -8(a6),d0

0x0000003C 0x08000002 btst #2,d0

0x00000040 0x670A beq.s *+12 ; 0x0000004c

0x00000042 0x202EFFFC move.l -4(a6),d0

0x00000046 0x5280 addq.l #1,d0

0x00000048 0x2D40FFFC move.l d0,-4(a6)

0x0000004C 0x202EFFF8 move.l -8(a6),d0

0x00000050 0x08000003 btst #3,d0

0x00000054 0x670A beq.s *+12 ; 0x00000060

0x00000056 0x202EFFFC move.l -4(a6),d0

0x0000005A 0x5280 addq.l #1,d0

0x0000005C 0x2D40FFFC move.l d0,-4(a6)

0x00000060 0x202EFFF8 move.l -8(a6),d0

0x00000064 0x08000004 btst #4,d0

0x00000068 0x670A beq.s *+12 ; 0x00000074

0x0000006A 0x202EFFFC move.l -4(a6),d0

0x0000006E 0x5280 addq.l #1,d0

0x00000070 0x2D40FFFC move.l d0,-4(a6)

0x00000074 0x202EFFFC move.l -4(a6),d0

0x00000078 0x4E5E unlk a6

0x0000007A 0x4E75 rts

Same again but with optimisation set to 2:

0x00000000 _bitcount:

0x00000000 0x4E560000 link a6,#0

0x00000004 0x518F subq.l #8,a7

0x00000006 0x2D40FFF8 move.l d0,-8(a6)

0x0000000A 0x7200 moveq #0,d1

0x0000000C 0x2D41FFFC move.l d1,-4(a6)

0x00000010 0x08000000 btst #0,d0

0x00000014 0x6708 beq.s *+10 ; 0x0000001e

0x00000016 0x7000 moveq #0,d0

0x00000018 0x7001 moveq #1,d0

0x0000001A 0x2D40FFFC move.l d0,-4(a6)

0x0000001E 0x202EFFF8 move.l -8(a6),d0

0x00000022 0x08000001 btst #1,d0

0x00000026 0x670A beq.s *+12 ; 0x00000032

0x00000028 0x202EFFFC move.l -4(a6),d0

0x0000002C 0x5280 addq.l #1,d0

0x0000002E 0x2D40FFFC move.l d0,-4(a6)

0x00000032 0x202EFFF8 move.l -8(a6),d0

0x00000036 0x08000002 btst #2,d0

0x0000003A 0x670A beq.s *+12 ; 0x00000046

0x0000003C 0x202EFFFC move.l -4(a6),d0

0x00000040 0x5280 addq.l #1,d0

0x00000042 0x2D40FFFC move.l d0,-4(a6)

0x00000046 0x202EFFF8 move.l -8(a6),d0

0x0000004A 0x08000003 btst #3,d0

0x0000004E 0x670A beq.s *+12 ; 0x0000005a

0x00000050 0x202EFFFC move.l -4(a6),d0

0x00000054 0x5280 addq.l #1,d0

0x00000056 0x2D40FFFC move.l d0,-4(a6)

0x0000005A 0x202EFFF8 move.l -8(a6),d0

0x0000005E 0x08000004 btst #4,d0

0x00000062 0x670A beq.s *+12 ; 0x0000006e

0x00000064 0x202EFFFC move.l -4(a6),d0

0x00000068 0x5280 addq.l #1,d0

0x0000006A 0x2D40FFFC move.l d0,-4(a6)

0x0000006E 0x202EFFFC move.l -4(a6),d0

0x00000072 0x4E5E unlk a6

0x00000074 0x4E75 rts

Same again with optimisation set to 4:

0x00000000 _bitcount:

0x00000000 0x4E560000 link a6,#0

0x00000004 0x518F subq.l #8,a7

0x00000006 0x2D40FFF8 move.l d0,-8(a6)

0x0000000A 0x7200 moveq #0,d1

0x0000000C 0x2D41FFFC move.l d1,-4(a6)

0x00000010 0x08000000 btst #0,d0

0x00000014 0x6706 beq.s *+8 ; 0x0000001c

0x00000016 0x7201 moveq #1,d1

0x00000018 0x2D41FFFC move.l d1,-4(a6)

0x0000001C 0x202EFFF8 move.l -8(a6),d0

0x00000020 0x08000001 btst #1,d0

0x00000024 0x670A beq.s *+12 ; 0x00000030

0x00000026 0x202EFFFC move.l -4(a6),d0

0x0000002A 0x5280 addq.l #1,d0

0x0000002C 0x2D40FFFC move.l d0,-4(a6)

0x00000030 0x202EFFF8 move.l -8(a6),d0

0x00000034 0x08000002 btst #2,d0

0x00000038 0x670A beq.s *+12 ; 0x00000044

0x0000003A 0x202EFFFC move.l -4(a6),d0

0x0000003E 0x5280 addq.l #1,d0

0x00000040 0x2D40FFFC move.l d0,-4(a6)

0x00000044 0x202EFFF8 move.l -8(a6),d0

0x00000048 0x08000003 btst #3,d0

0x0000004C 0x670A beq.s *+12 ; 0x00000058

0x0000004E 0x202EFFFC move.l -4(a6),d0

0x00000052 0x5280 addq.l #1,d0

0x00000054 0x2D40FFFC move.l d0,-4(a6)

0x00000058 0x202EFFF8 move.l -8(a6),d0

0x0000005C 0x08000004 btst #4,d0

0x00000060 0x670A beq.s *+12 ; 0x0000006c

0x00000062 0x202EFFFC move.l -4(a6),d0

0x00000066 0x5280 addq.l #1,d0

0x00000068 0x2D40FFFC move.l d0,-4(a6)

0x0000006C 0x202EFFFC move.l -4(a6),d0

0x00000070 0x4E5E unlk a6

0x00000072 0x4E75 rts

Interestly, I tried using the Netburner GCC based compiler and found this:

00000000 <bitcount>:

0: 4e56 0000 linkw %fp,#0

4: 222e 0008 movel %fp@(8),%d1

8: 7001 moveq #1,%d0

a: c081 andl %d1,%d0

c: 0801 0001 btst #1,%d1

10: 6702 beqs 14 <bitcount+0x14>

12: 5280 addql #1,%d0

14: 0801 0002 btst #2,%d1

18: 6702 beqs 1c <bitcount+0x1c>

1a: 5280 addql #1,%d0

1c: 0801 0003 btst #3,%d1

20: 6702 beqs 24 <bitcount+0x24>

22: 5280 addql #1,%d0

24: 0801 0004 btst #4,%d1

28: 6702 beqs 2c <bitcount+0x2c>

2a: 5280 addql #1,%d0

2c: 4e5e unlk %fp

2e: 4e75 rts

Netburner MacOS tools version string:

$ m68k-elf-gcc --version

m68k-elf-gcc (GCC) 4.2.1

The GCC option in the Codewarrior IDE is only for ARM targets.

Shaun

TomE · ‎03-30-2015

Shaun James wrote:

> Using the Codewarrior 10.2 for Linux which uses the Metrowerks compiler,

Thank you very much for running these tests. There are three interesting results.

Firstly it looks like this "misfeature" wasn't in gcc 4.2.1 (July 2007), but was added before 4.3.3 (Jan 2009) and is still there in 4.7.2 (Sep 2012).

Secondly, Metrowerks doesn't do this.

Thirdly, in your tests Metrowerks is generating 2.5 times as much code as gcc does - 114 bytes versus 46 bytes. Are you sure the optimisation was set properly and there wasn't something else (like debugging) enabled to make it generate code like that?

Tom

yibbidy · ‎03-30-2015

I think you are right, there must be some compiler option switch somewhere in the CW GUI that I am not finding. Here is the output from Codewarrior when the project is initially created with the "wizard" to have full optimisation:

0x00000000	_bitcount:

0x00000000 0x4E560000	link	a6,#0

0x00000004 0x7200

moveq

#0,d1

0x00000006 0x08000000	btst	#0,d0
0x0000000A 0x6702	beq.s	*+4	; 0x0000000e

0x0000000C 0x7201

moveq

#1,d1

0x0000000E 0x08000001	btst	#1,d0
0x00000012 0x6702	beq.s	*+4	; 0x00000016

0x00000014 0x5281

addq.l #1,d1

0x00000016 0x08000002	btst	#2,d0
0x0000001A 0x6702	beq.s	*+4	; 0x0000001e

0x0000001C 0x5281

addq.l #1,d1

0x0000001E 0x08000003	btst	#3,d0
0x00000022 0x6702	beq.s	*+4	; 0x00000026

0x00000024 0x5281

addq.l #1,d1

0x00000026 0x08000004	btst	#4,d0
0x0000002A 0x6702	beq.s	*+4	; 0x0000002e

0x0000002C 0x5281

addq.l #1,d1

0x0000002E 0x2001

move.l d1,d0

0x00000030 0x4E5E	unlk	a6
0x00000032 0x4E75	rts

You can see from the above three Codewarrior tests in the previous post that the code gets smaller by a few bytes for each increase in optimisation level, so changing the optimisation setting is definitely doing "something". The project for them was initially defined to have no optimisation then I increased it later. I've always started projects that way, then increased the optimisation later. It seems I've released quite a bit of bloated code :smileysad:. Anyway, nice to see that CW is not generating code 2.5 times larger than GCC.

TomE · ‎05-22-2014

> I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

According to this post, CodeWarrior 10.4 has a "gcc option".

https://community.freescale.com/thread/307204

I don't have CW. Does this mean there's an option to use either compiler for a project?

I would be interested to see if the gcc compiler provided by Freescale generate code with this problem for the CF3 parts.

Also what code the Freescale tools generate for the same source.

Tom

GCC Clobbering Branch Predict Bit in CF3 Cores

GCC Clobbering Branch Predict Bit in CF3 Cores

General