Converting a C function into inline assembly

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sun Feb 03 08:44:42 MST 2013
I have a relatively simple 5 statement C function that I would like to
convert to in-line assembly.   I have converted other C functions and find
it tricky each time.   The mnemonics are different.   Some in-line opcodes
result in multiple assembly opcodes.   The registers don't always map
the way that I want them to.

This seems like simple process that could be automated.   If it has, I have
not found it.

I do understand assembly having written a lot of Z80 and 8086 code in
the past.   I am fluent in C code.    I don't have my head around in-line
assembly yet.   That is the problem!

The reason I want to do this is that I want to move to the 5.0.14
release from my current 4.3.0.   The 5.0.14 release produces slower code
for one critical routine.   If I could convert that routine to assembly then I
lock in the better optimization of the older release.

This is in reference to the problem that I described in this thread;
http://knowledgebase.nxp.com/showthread.php?t=3939&page=2

Thanks,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cfb on Wed Feb 06 00:28:59 MST 2013

Quote: micrio
Are you suggesting that giving the -O3 is not actually
giving me full speed optimization?
Pete.

To identify the optimisations resulting from the use of -O3 and to compare what -O3 gives you in LPCXpresso v4 and v5 run your LPCXpresso version of gcc (arm-none-eabi-gcc.exe) with the following options:

[FONT=Courier New]-O3 -Q --help=optimizers[/FONT]

An explanation of the results is here:

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Rob65 on Tue Feb 05 01:17:36 MST 2013

Quote: micrio
Are you suggesting that giving the -O3 is not actually
giving me full speed optimization?

Never assume that a higher optimization level gives you a higher speed.
There are different settings for speed and size optimization and, as I stated before, the LPCXpresso 4 tools are standard set to -O2 whereas 5 uses -Os, which is optimizing for size rather than speed.

It used to be such that -O1 did some optimization, -O2 optimized for speed, -O3 for size and speed and -Os was for the smallest code size.
-O3 delivered code that lies somewhere between -O2 and -Os, so not as small as -Os and not as fast as -O2.
But please remember that this is just a rule of thumb. Depending on your program this may differ.

When it comes to execution speed you should really try different settings and different structuring in your C program to see what is best for a specific function.
And then carefully check if a new LPCXpresso tool version incorporates a new gnu-arm compiler since optimizations may differ slightly between versions.

Rob

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cfb on Tue Feb 05 00:44:41 MST 2013

Quote: micrio
Are you suggesting that giving the -O3 is not actually
giving me full speed optimization?

The main difference I could see between -O2 and -O3 was that -O3 expanded the function inline rather than calling it as a function. As most of your time is spent in the loop I would not expect you would see much difference between -O2 and -O3 for the example that you supplied.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Mon Feb 04 11:23:20 MST 2013

Quote: TheFallGuy
I guess this has been caused by the change in optimisation options between v4 and v5. V5 now uses space optimisation which reduces code size, but can reduce execution speed. If you want to build a function to a particular optimisation level, you can use
#pragma optimze("o3")
, or you can just change the optimisation for the whole project to match what you used in v4

I did try -O3 (and every other optimization setting) in the version 5
environment. Are you suggesting that giving the -O3 is not actually
giving me full speed optimization?

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by TheFallGuy on Mon Feb 04 09:53:23 MST 2013
I guess this has been caused by the change in optimisation options between v4 and v5. V5 now uses space optimisation which reduces code size, but can reduce execution speed. If you want to build a function to a particular optimisation level, you can use
#pragma optimze("o3")
, or you can just change the optimisation for the whole project to match what you used in v4

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cfb on Mon Feb 04 05:31:48 MST 2013

Quote: micrio
I have not tried to reformulate
the C code to gain performance in the new revision. Perhaps that might
help. I suspect that the statements that do the array indexing and the
shifting call up some obscure optimizer quirk that previously worked
to my advantage.

What the optimiser noticed was that the number of times your loop was executed was a multiple of two. It has partially unrolled your loop so that it does two assignments in each iteration instead of one so it only has to loop half as many times. If you have 20 assignments then the v4 version executes 10 instructions 10 times (= 100) but the new version executes 8 instructions 20 times (= 160). You can rewrite your C code to explicitly do the same.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sun Feb 03 20:28:04 MST 2013

Quote: wrighflyer
Why not build the code with -save-temps and just carry the .S file over from 4.x to 5.x?

That is a great idea, I will give it a try.

Thanks,
Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sun Feb 03 20:27:06 MST 2013

Quote: cfb
The v5 code might have less instructions but it appears to be looping twice as many times as thev4 code. Before attempting to use inline assembler try rewriting the C code to be more efficient instead of relying on the optimiser to fix it up for you.

I have worked on the C code a great deal to improve performance.   What
you see is the best possible code for the version 4.3.0 environment.
When I moved to 5.0.14 it went backward.   I have not tried to reformulate
the C code to gain performance in the new revision.   Perhaps that might
help.   I suspect that the statements that do the array indexing and the
shifting call up some obscure optimizer quirk that previously worked
to my advantage.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by micrio on Sun Feb 03 20:18:44 MST 2013

Quote: Rob65
Could you check the optimization settings in both LPCXpresso IDEs?

Creating a project in 4.2.2 sets the default optimization to -O2 (optimize more) but in 5.0.14 it is set to -Os (optimize for size).

-O2 results in code that runs faster than -Os.
Maybe this is your problem.

Rob

A little later in that thread I give the performance figures for all optimizer
settings for both compiler versions. At low settings they are about the
same. At -O2 and especially -O3 the older compiler was far better.

Pete.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cfb on Sun Feb 03 14:38:50 MST 2013
The v5 code might have less instructions but it appears to be looping twice as many times as thev4 code. Before attempting to use inline assembler try rewriting the C code to be more efficient instead of relying on the optimiser to fix it up for you.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Rob65 on Sun Feb 03 14:37:57 MST 2013
Could you check the optimization settings in both LPCXpresso IDEs?

Creating a project in 4.2.2 sets the default optimization to -O2 (optimize more) but in 5.0.14 it is set to -Os (optimize for size).

-O2 results in code that runs faster than -Os.
Maybe this is your problem.

Rob

lpcware · ‎06-15-2016

Content originally posted in LPCWare by wrighflyer on Sun Feb 03 10:50:17 MST 2013
Why not build the code with -save-temps and just carry the .S file over from 4.x to 5.x?