GCC Clobbering Branch Predict Bit in CF3 Cores

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

GCC Clobbering Branch Predict Bit in CF3 Cores

1,878 Views
TomE
Specialist II

I've searched for this apparent bug on gcc.gnu.org/bugzilla, but didn't find anything.

 

Many of those reading this will be using Code Warrior, and I assume this doesn't apply to you. I'd be interested in seeing the compiler's generated code for bit testing to see if it is every doing this though.

 

I'm using:

 

$ m68k-elf-gcc --version

m68k-elf-gcc.exe (Sourcery G++ Lite 4.3-208) 4.3.3

Copyright (C) 2008 Free Software Foundation, Inc.

 

The command line is:

 

m68k-elf-gcc -MD -MF adl3.d0 -gdwarf-2 -mcpu=5329 -Wall -std=c99 -g  -Os ...

 

Note the "-mcpu=5329" line.

 

A small snippet of source code:

 

#define cd_BUS_OFF 0x2000

#define cd_BUS_OFF 0x1000

#define cd_OVERRUN 0x0008

 

    else if (cc->status & cd_BUS_OFF) {

40109020:       202a 003c       movel %a2@(60),%d0

40109024:       0800 000d       btst #13,%d0

40109028:       6710            beqs 4010903a <comm_check_status+0x58>

(other code)

    } else if (cc->status & cd_BUS_RWARN) {

4010903a:       0800 000c       btst #12,%d0

4010903e:       6604            bnes 40109044 <comm_check_status+0x62>

    } else if (cc->status & cd_OVERRUN) {

40109040:       44c0            movew %d0,%ccr

40109042:       6a02            bpls 40109046 <comm_check_status+0x64>

 

Note the weird word-saving trick in the last compare? It copies the data to the CCR and then tests the "N" bit.

 

It only seems to do this trick for the "N" and "Z" bits (bits 2 and 3). it should be able to do it for "C" and "V" (0 and 1), but I haven't seen any code doing this. I've also seen code like:

 

4012193e:       44c2            movew %d2,%ccr

40121940:       57c0            seq %d0

40121942:       44c2            movew %d2,%ccr

40121944:       5bc1            smi %d1

 

What's the problem? Bit 7 of the CCR is documented in the CFPRM and MCF5329 Reference Manual as:

 

    Branch prediction (Version 3 only). Alters the static

    prediction algorithm used by the branch acceleration

    logic in the instruction fetch pipeline on forward

    conditional branches. Refer to a V3 core or device

    user’s manual for further information on this bit.

 

So the use of the "movew %d0,%ccr" instruction is changing the CPU's branch prediction randomly. It shouldn't be doing that.

 

Does anyone know if this was ever fixed?

 

Tom

Labels (1)
0 Kudos
8 Replies

1,183 Views
TomE
Specialist II

So why is this a problem?

With the prediction bit changed, all forward branches (exits out of a loop) will be mis-predicted as taken. That makes them take FIVE times as long to execute as when predicted as not taken.

This could make fast inner loops have variable execution times depending on what the last bit-test instruction data had in the 8th bit.

From the CodeSourcery Mailing List I received the following suggestion:

Have you tried instructing GCC to avoid the use of the CCR register
using "-ffixed-ccr"?

Thank you for that suggestion. Unfortunately:

$ m68k-elf-gcc -c -Os -ffixed-ccr -o mmOsccr m68kbits.c

cc1.exe: warning: unknown register name: ccr

That option works with other register names like "d0", "sp", "a1".

It still did this:

  1a:  44c1            movew %d1,%ccr

  1c:  6a02            bpls 20 <main+0x20>

it doesn't help that the libraries are all full of this construct.

$ m68k-elf-objdump -S libc.a | grep ccr | wc

    89    356    3470

So even if you try and actively bypass or correct for this corruption, lots of the standard C library functions will corrupt it back for you.

Then another exchange:

Hmm, too bad.
If it's really an "over-optimization" of the compiler you can try to
disable optimizations (-O0) and check whether the problem persists.

The code gets horrible without any optimisations:

-O2:

   c:   0801 0001       btst #1,%d1

  10:   6702            beqs 14 <main+0x14>

No optimisation, note the TWO redundant instructions.

   8:   7001            moveq #1,%d0

   a:   c0ae 0008       andl %fp@(8),%d0

   e:   1000            moveb %d0,%d0

  10:   4a00            tstb %d0

  12:   6704            beqs 18 <main+0x18>

If it doesn't, turn optimizations back on

Another good suggestion. I'll try that tomorrow.

The problem is with whoever put this misguided "feature" into the CF3 Condition Code Register. They should have

known at the time that compilers were free to use this trick, and probably had been doing this since the 68000.

The feature should have been in a different special register, or been able to be enabled and disabled in a

different control register.

I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

int bitcount(int bits)

{

        int nRes = 0;

        if (bits & 0x01) nRes++;

        if (bits & 0x02) nRes++;

        if (bits & 0x04) nRes++;

        if (bits & 0x08) nRes++;

        if (bits & 0x10) nRes++;

        return nRes;

}

Compiled with the following gcc command line this results in:

$ m68k-elf-gcc --version

m68k-elf-gcc.exe (Sourcery G++ Lite 4.3-208) 4.3.3


$ m68k-elf-gcc -c -mcpu=5235 -O2 -o bits bits.c


$ m68k-elf-objdump -S bits

00000000 <bitcount>:

   0:   7001            moveq #1,%d0

   2:   4e56 0000       linkw %fp,#0

   6:   222e 0008       movel %fp@(8),%d1

   a:   c081            andl %d1,%d0

   c:   0801 0001       btst #1,%d1

  10:   6702            beqs 14 <bitcount+0x14>

  12:   5280            addql #1,%d0

  14:   44c1            movew %d1,%ccr

  16:   6602            bnes 1a <bitcount+0x1a>

  18:   5280            addql #1,%d0

  1a:   44c1            movew %d1,%ccr

  1c:   6a02            bpls 20 <bitcount+0x20>

  1e:   5280            addql #1,%d0

  20:   0801 0004       btst #4,%d1

  24:   6702            beqs 28 <bitcount+0x28>

  26:   5280            addql #1,%d0

  28:   4e5e            unlk %fp

  2a:   4e75            rts

The GREEN lines are using "normal" bit tests that don't corrupt the ccr. The RED ones are using the one-word "shortcut".

Tom

0 Kudos

1,183 Views
matthey
Contributor II

I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

int bitcount(int bits)

{

        int nRes = 0;

        if (bits & 0x01) nRes++;

        if (bits & 0x02) nRes++;

        if (bits & 0x04) nRes++;

        if (bits & 0x08) nRes++;

        if (bits & 0x10) nRes++;

        return nRes;

}

Using vbcc I get:

_bitcount:

   subq.l #4,sp

   movem.l d2,(sp)

   move.l (8,sp),d2

   moveq #0,d1

   moveq #1,d0

   and.l d2,d0

   beq.b lab_3e8

   moveq #1,d1

lab_3e8:

   moveq #2,d0

   and.l d2,d0

   beq.b lab_3f0

   addq.l #1,d1

lab_3f0:

   moveq #4,d0

   and.l d2,d0

   beq.b lab_3f8

   addq.l #1,d1

lab_3f8:

   moveq #8,d0

   and.l d2,d0

   beq.b lab_400

   addq.l #1,d1

lab_400:

   moveq #$10,d0

   and.l d2,d0

   beq.b lab_408

   addq.l #1,d1

lab_408:

   move.l d1,d0

   movem.l (sp),d2

   addq.l #4,sp

   rts

I count 62 bytes so it did ok in size and avoids writing to the CCR. Performance would be better on a superscalar CPU like the 68060 or CF v5 with early instruction completion and forwarding of MOVEQ.

A couple of notes about vbcc:

It uses the stack instead of a stack frame by default as it generates better code (especially on the 68k with a MOVEM supporting pre-decrement and post-increment).

Vbcc has the best 68k (and maybe ColdFire) peephole optimizing assembler ever in vasm.

The vclib link library and inlines for vbcc are written by 68k enthusiasts and are becoming more optimized. Send in your optimized CF code to have it included.

The 68k/ColdFire backend is too simple and needs improvement but bugs are fixed. There is no instruction scheduler yet.

The original idea of vbcc was to let the peephole optimizing assembler optimize instructions like MOVEM->MOVE above but it can't for a data register because of the CC flags.

The source code for vbcc, vasm and vlink are online and there are few dependencies. I compile them with themselves on my 68060 Amiga.

I was able to compile with -cpu=5329 to generate CF code into my Amiga Hunk format executable as the backend for 68k and ColdFire is shared.

Vbcc was originally targeted at embedded but has become popular for retro computers and processors also. C99 support is pretty good and improving.

The Atari (ColdFire) Firebee is supported and an Amiga FPGA 68k processor should be mostly ColdFire compatible and at least partially supported.

0 Kudos

1,183 Views
TomE
Specialist II

Thanks for providing that example. Gcc is being a little "too smart" here.

The most amazing piece of code I've seen an older version of gcc generate was a "multiply by three" in ONE instruction WITHOUT using the "MUL" or (the more usual) shift and add sequences "MUL" took 70 clocks on the 68000 and 41 clocks on a 68020. Shifts and adds took a minimum of 3 instructions and needed another register. Instead it generated an "lea (a1,a1.l*2),a1" instruction!

> I count 62 bytes so it did ok in size and avoids writing to the CCR.

If gcc uses the proper "btst" throughout the function would be 46 bytes. That's 74% of what vbcc managed. Vbcc also takes 4 16-bit instructions instead of three (two 16-bit, one 32-bit) which would be faster on most CPUs.

I've just noticed that this corruption of the MCF53xx branch prediction is bad, but not as bad as it could be. It "only" reverses the prediction of Forward branches. The prediction of backward branches is unchanged. Still, that would slow down a lot of code.

Tom

0 Kudos

1,183 Views
TomE
Specialist II

I'm profiling some code to see why the serial port transmit interrupt routine is taking so much longer than it should. I was able to get it 30% faster by getting rid of a common interrupt routine that was being called with "interface indices" by the service routines for UARTS 0, 1 and 2 because the code was spending way too much time indexing various arrays. Then the profiling showed the CPU was spending 8 times longer one one instruction than I expected it should. The specific instruction is "movew %d1, %ccr". So why is that (seemingly) taking so much longer?

Gcc is doing that to replace a two-word "btst" instruction with a one-word instruction. That might be a bit faster on a 68000 [1], but on a ColdFire, the immediate "btst" instruction takes ONE clock, so can't be replaced by anything faster.

The replacement "movew %d1,%ccr" instruction takes (now reading the Reference Manuals properly...) ONE clock on MCF51, MCF52, MCF53 and MCF54. So no difference (except being two bytes shorter and corrupting Branch Predict on MCF53).

Note 1: And how about the older chips? Surely this "trick" was written when going from 4 bytes to 2 bytes (for the instruction change) meant something?

68000 and 68010 took 10 clocks for BTST and 12 clocks for the CCR replacement. So the replacement would be slower.

68EC020 takes one clock for BTST and 4 clocks for the CCR replacement. Slower again.

So why do my measurement show it taking 8 times longer on that instruction? That's probably something weird in the profiling.

Tom

0 Kudos

1,183 Views
yibbidy
Contributor V

I know this was an old post but thought that you might still be interested.

First of all I tried my later version of Codesourcery GCC and it has the identical output that you have.  My version is:

m68k-elf-cpp (Sourcery CodeBench 2012.09-80) 4.7.2

Using the Codewarrior 10.2 for Linux which uses the Metrowerks compiler, with no optimisations:

0x00000000                    _bitcount:

0x00000000  0x4E560000               link     a6,#0

0x00000004  0x518F                   subq.l   #8,a7

0x00000006  0x2D40FFF8               move.l   d0,-8(a6)

0x0000000A  0x7000                   moveq    #0,d0

0x0000000C  0x2D40FFFC               move.l   d0,-4(a6)

0x00000010  0x202EFFF8               move.l   -8(a6),d0

0x00000014  0x08000000               btst     #0,d0

0x00000018  0x670A                   beq.s    *+12                  ; 0x00000024

0x0000001A  0x202EFFFC               move.l   -4(a6),d0

0x0000001E  0x5280                   addq.l   #1,d0

0x00000020  0x2D40FFFC               move.l   d0,-4(a6)

0x00000024  0x202EFFF8               move.l   -8(a6),d0

0x00000028  0x08000001               btst     #1,d0

0x0000002C  0x670A                   beq.s    *+12                  ; 0x00000038

0x0000002E  0x202EFFFC               move.l   -4(a6),d0

0x00000032  0x5280                   addq.l   #1,d0

0x00000034  0x2D40FFFC               move.l   d0,-4(a6)

0x00000038  0x202EFFF8               move.l   -8(a6),d0

0x0000003C  0x08000002               btst     #2,d0

0x00000040  0x670A                   beq.s    *+12                  ; 0x0000004c

0x00000042  0x202EFFFC               move.l   -4(a6),d0

0x00000046  0x5280                   addq.l   #1,d0

0x00000048  0x2D40FFFC               move.l   d0,-4(a6)

0x0000004C  0x202EFFF8               move.l   -8(a6),d0

0x00000050  0x08000003               btst     #3,d0

0x00000054  0x670A                   beq.s    *+12                  ; 0x00000060

0x00000056  0x202EFFFC               move.l   -4(a6),d0

0x0000005A  0x5280                   addq.l   #1,d0

0x0000005C  0x2D40FFFC               move.l   d0,-4(a6)

0x00000060  0x202EFFF8               move.l   -8(a6),d0

0x00000064  0x08000004               btst     #4,d0

0x00000068  0x670A                   beq.s    *+12                  ; 0x00000074

0x0000006A  0x202EFFFC               move.l   -4(a6),d0

0x0000006E  0x5280                   addq.l   #1,d0

0x00000070  0x2D40FFFC               move.l   d0,-4(a6)

0x00000074  0x202EFFFC               move.l   -4(a6),d0

0x00000078  0x4E5E                   unlk     a6

0x0000007A  0x4E75                   rts    

Same again but with optimisation set to 2:

0x00000000                    _bitcount:

0x00000000  0x4E560000               link     a6,#0

0x00000004  0x518F                   subq.l   #8,a7

0x00000006  0x2D40FFF8               move.l   d0,-8(a6)

0x0000000A  0x7200                   moveq    #0,d1

0x0000000C  0x2D41FFFC               move.l   d1,-4(a6)

0x00000010  0x08000000               btst     #0,d0

0x00000014  0x6708                   beq.s    *+10                  ; 0x0000001e

0x00000016  0x7000                   moveq    #0,d0

0x00000018  0x7001                   moveq    #1,d0

0x0000001A  0x2D40FFFC               move.l   d0,-4(a6)

0x0000001E  0x202EFFF8               move.l   -8(a6),d0

0x00000022  0x08000001               btst     #1,d0

0x00000026  0x670A                   beq.s    *+12                  ; 0x00000032

0x00000028  0x202EFFFC               move.l   -4(a6),d0

0x0000002C  0x5280                   addq.l   #1,d0

0x0000002E  0x2D40FFFC               move.l   d0,-4(a6)

0x00000032  0x202EFFF8               move.l   -8(a6),d0

0x00000036  0x08000002               btst     #2,d0

0x0000003A  0x670A                   beq.s    *+12                  ; 0x00000046

0x0000003C  0x202EFFFC               move.l   -4(a6),d0

0x00000040  0x5280                   addq.l   #1,d0

0x00000042  0x2D40FFFC               move.l   d0,-4(a6)

0x00000046  0x202EFFF8               move.l   -8(a6),d0

0x0000004A  0x08000003               btst     #3,d0

0x0000004E  0x670A                   beq.s    *+12                  ; 0x0000005a

0x00000050  0x202EFFFC               move.l   -4(a6),d0

0x00000054  0x5280                   addq.l   #1,d0

0x00000056  0x2D40FFFC               move.l   d0,-4(a6)

0x0000005A  0x202EFFF8               move.l   -8(a6),d0

0x0000005E  0x08000004               btst     #4,d0

0x00000062  0x670A                   beq.s    *+12                  ; 0x0000006e

0x00000064  0x202EFFFC               move.l   -4(a6),d0

0x00000068  0x5280                   addq.l   #1,d0

0x0000006A  0x2D40FFFC               move.l   d0,-4(a6)

0x0000006E  0x202EFFFC               move.l   -4(a6),d0

0x00000072  0x4E5E                   unlk     a6

0x00000074  0x4E75                   rts    

Same again with optimisation set to 4:

0x00000000                    _bitcount:

0x00000000  0x4E560000               link     a6,#0

0x00000004  0x518F                   subq.l   #8,a7

0x00000006  0x2D40FFF8               move.l   d0,-8(a6)

0x0000000A  0x7200                   moveq    #0,d1

0x0000000C  0x2D41FFFC               move.l   d1,-4(a6)

0x00000010  0x08000000               btst     #0,d0

0x00000014  0x6706                   beq.s    *+8                   ; 0x0000001c

0x00000016  0x7201                   moveq    #1,d1

0x00000018  0x2D41FFFC               move.l   d1,-4(a6)

0x0000001C  0x202EFFF8               move.l   -8(a6),d0

0x00000020  0x08000001               btst     #1,d0

0x00000024  0x670A                   beq.s    *+12                  ; 0x00000030

0x00000026  0x202EFFFC               move.l   -4(a6),d0

0x0000002A  0x5280                   addq.l   #1,d0

0x0000002C  0x2D40FFFC               move.l   d0,-4(a6)

0x00000030  0x202EFFF8               move.l   -8(a6),d0

0x00000034  0x08000002               btst     #2,d0

0x00000038  0x670A                   beq.s    *+12                  ; 0x00000044

0x0000003A  0x202EFFFC               move.l   -4(a6),d0

0x0000003E  0x5280                   addq.l   #1,d0

0x00000040  0x2D40FFFC               move.l   d0,-4(a6)

0x00000044  0x202EFFF8               move.l   -8(a6),d0

0x00000048  0x08000003               btst     #3,d0

0x0000004C  0x670A                   beq.s    *+12                  ; 0x00000058

0x0000004E  0x202EFFFC               move.l   -4(a6),d0

0x00000052  0x5280                   addq.l   #1,d0

0x00000054  0x2D40FFFC               move.l   d0,-4(a6)

0x00000058  0x202EFFF8               move.l   -8(a6),d0

0x0000005C  0x08000004               btst     #4,d0

0x00000060  0x670A                   beq.s    *+12                  ; 0x0000006c

0x00000062  0x202EFFFC               move.l   -4(a6),d0

0x00000066  0x5280                   addq.l   #1,d0

0x00000068  0x2D40FFFC               move.l   d0,-4(a6)

0x0000006C  0x202EFFFC               move.l   -4(a6),d0

0x00000070  0x4E5E                   unlk     a6

0x00000072  0x4E75                   rts    

Interestly, I tried using the Netburner GCC based compiler and found this:

00000000 <bitcount>:

   0:    4e56 0000      linkw %fp,#0

   4:    222e 0008      movel %fp@(8),%d1

   8:    7001           moveq #1,%d0

   a:    c081           andl %d1,%d0

   c:    0801 0001      btst #1,%d1

  10:    6702           beqs 14 <bitcount+0x14>

  12:    5280           addql #1,%d0

  14:    0801 0002      btst #2,%d1

  18:    6702           beqs 1c <bitcount+0x1c>

  1a:    5280           addql #1,%d0

  1c:    0801 0003      btst #3,%d1

  20:    6702           beqs 24 <bitcount+0x24>

  22:    5280           addql #1,%d0

  24:    0801 0004      btst #4,%d1

  28:    6702           beqs 2c <bitcount+0x2c>

  2a:    5280           addql #1,%d0

  2c:    4e5e           unlk %fp

  2e:    4e75           rts

Netburner MacOS tools version string:

$ m68k-elf-gcc --version

m68k-elf-gcc (GCC) 4.2.1

The GCC option in the Codewarrior IDE is only for ARM targets.

Shaun

1,183 Views
TomE
Specialist II

Shaun James wrote:

> Using the Codewarrior 10.2 for Linux which uses the Metrowerks compiler,

Thank you very much for running these tests. There are three interesting results.

Firstly it looks like this "misfeature" wasn't in gcc 4.2.1 (July 2007), but was added before 4.3.3 (Jan 2009) and is still there in 4.7.2 (Sep 2012).

Secondly, Metrowerks doesn't do this.

Thirdly, in your tests Metrowerks is generating 2.5 times as much code as gcc does - 114 bytes versus 46 bytes. Are you sure the optimisation was set properly and there wasn't something else (like debugging) enabled to make it generate code like that?

Tom

0 Kudos

1,183 Views
yibbidy
Contributor V

I think you are right, there must be some compiler option switch somewhere in the CW GUI that I am not finding.  Here is the output from Codewarrior when the project is initially created with the "wizard" to have full optimisation:

0x00000000               _bitcount:
0x00000000  0x4E560000          linka6,#0
0x00000004  0x7200              moveq#0,d1
0x00000006  0x08000000          btst#0,d0
0x0000000A  0x6702              beq.s*+4              ; 0x0000000e
0x0000000C  0x7201              moveq#1,d1
0x0000000E  0x08000001          btst#1,d0
0x00000012  0x6702              beq.s*+4              ; 0x00000016
0x00000014  0x5281              addq.l   #1,d1
0x00000016  0x08000002          btst#2,d0
0x0000001A  0x6702              beq.s*+4              ; 0x0000001e
0x0000001C  0x5281              addq.l   #1,d1
0x0000001E  0x08000003          btst#3,d0
0x00000022  0x6702              beq.s*+4              ; 0x00000026
0x00000024  0x5281              addq.l   #1,d1
0x00000026  0x08000004          btst#4,d0
0x0000002A  0x6702              beq.s*+4              ; 0x0000002e
0x0000002C  0x5281              addq.l   #1,d1
0x0000002E  0x2001              move.l   d1,d0
0x00000030  0x4E5E              unlka6
0x00000032  0x4E75              rts

You can see from the above three Codewarrior tests in the previous post that the code gets smaller by a few bytes for each increase in optimisation level, so changing the optimisation setting is definitely doing "something".  The project for them was initially defined to have no optimisation then I increased it later.  I've always started projects that way, then increased the optimisation later.  It seems I've released quite a bit of bloated code :smileysad:.   Anyway, nice to see that CW is not generating code 2.5 times larger than GCC.

0 Kudos

1,183 Views
TomE
Specialist II

> I'd be interested in seeing the assembly output or disassembly of the following code using "your favourite compiler":

According to this post, CodeWarrior 10.4 has a "gcc option".

https://community.freescale.com/thread/307204

I don't have CW. Does this mean there's an option to use either compiler for a project?

I would be interested to see if the gcc compiler provided by Freescale generate code with this problem for the CF3 parts.

Also what code the Freescale tools generate for the same source.

Tom

0 Kudos