Function calls from ASM

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Fklein23 on Wed Jul 23 13:03:08 MST 2014
I have the following assembly function:

        .thumb_func
        .global PixelRotDelay

PixelRotDelay:
movr1,r1             // approximately 12.3 nanosecs per mov
movr1,r1
movr1,r1             // approximately 12.3 nanosecs per mov
movr1,r1
movr1,r1
movr1,r1
bx lr                   // function return

When I call this function from C code, This is the disassembly of the caller:

450           PixelRotDelay();
0000a306:   bl      0x68d4 <PixelRotDelay>

... and this works fine and does just what I want. the bx,lr instruction returns to the right place and the stack is back to its original, pre-function call, state.

But, I need to embed calls to this function from inside assembly language, too, because the function injects a precisely calibrated delay, which is needed in dozens of places.

Unfortunately, this code, inside an assembly language function:

      bl PixelRotDelay

crashes the stack.

What am I doing wrong???

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Fklein23 on Mon Jul 28 09:22:20 MST 2014
Many thanks to MikeSimmons and cfb both. Excellent advice all around.
- Frank

lpcware · ‎06-15-2016

Content originally posted in LPCWare by cfb on Sun Jul 27 04:41:51 MST 2014

Quote: Fklein23
I tried to get a handle on the difference between br and bl by reading the ARM manuals and I didn't find anything about being responsible for saving lr prior to the call

I recommend that you get yourself a copy of "The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition" By Joseph Yiu. I have found it to be invaluable for answering those sorts of questions.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by MikeSimmonds on Thu Jul 24 10:30:00 MST 2014
Using the LPCXpresso debugger -- in the register window is an item called 'cycle counter'

Set a bp (break point) at any bl TheDelayFunc and another at the line sfter tha call.
When you get to the 1st bp, clear the cycle counter in the reg win [I think you right-click and select clear]
click run (or whatever the kb short cut is)
when it stops multiply the value in cycle count by the clock speed.
[Or don't clear and subtract one from t'other befor multiplying]

Of course, actual measurement with the silly-scope takes theory to practice.

Mike

lpcware · ‎06-15-2016

Content originally posted in LPCWare by MikeSimmonds on Thu Jul 24 10:20:39 MST 2014
The AA7M gives instruction cycles (most non-memory will be one) to get cycles on a real tine unit,
take the reciprical of the cpu clock frequency. From your post, I infer that you run at 72 MHz.

So 1/72,000,000 times 10^9 (to get nano-seconds gives 13.88888888888888888888888888888.... nanosec

A quick cheat (by canceling out loads of the zeros) is to take 1000 / Freq (in MHz) to get nano secs
I.e. 1000/72 = 13.89 ns

Mike

lpcware · ‎06-15-2016

Content originally posted in LPCWare by MikeSimmonds on Thu Jul 24 10:13:49 MST 2014
The AA7M (Arm Architecture manual version 7M) is the 'cpu' reference and details all the instruction formats and syntaxes. It gives a programmers overview, and more (that aren't of interest to programmers).
It is free from the arm info site. But it is a lot to take at a gulp so maybe google for a starters page.

Anyway, the lr is just another 32-bit register BUT, a bl or blx (branch with LINK) works just like a b (thats branch) of bx but puts the 'return address' in to the ONE AND ONLY link register (aka r14).

So a simple call can end with a simple bx lr (branch to the address in lr).

However, when you make a nested call, your new bl ... will OVERWRITE the ONE AND ONLY link register
with YOUR return address. The nested function returns to YOUR return address, but when you try to return to YOUR caller, its link address has been lost. Totally different scheme to X86, AVR, 8051, etc. etc. But much
more efficient for single level calls.

ARM never stacks/unstacks (pushes/pops) anything unless you (or the "C" compiler) tells it to.
[Interrupts are a different kettle of fish--they do push lots of stuff]

As for pop {...,pc} when you pushed lr. It is a shortcut for the longer sequence pop {...,lr} followed by bx lr.
You don't care at this point about what is actually in the lr, as long as you branch back to where it said.

Indeed (and somewhat dangerously) ANY intruction that updates the pc (r15) will cause a change of execution point. Eg add pc, r7,r0 sub pc, pc, 0x1000, mov pc, any number. DONT DO THIS -- there are gotchas. [1. the pc read in an instruction is here + 8] [2. the actual value has to have the botom bit set! (to indicate THUNB-2 code even though the cortex can't do ARM code -- forget and you get a usage fault.
so to jump to absolute 0x4100 yoy would move pc, 0x4101. You don't normally bother with this as the assember knows about this and (.thumb_func) and deals with it] That was a very long gotcha 2 wasn't it.

Anyway, now you know why you have to save the lr!

Cheers, Mike.

lpcware · ‎06-15-2016

Content originally posted in LPCWare by Fklein23 on Thu Jul 24 07:56:52 MST 2014
Thanks. I will try that. I am new to ARM. I am learning by doing.
I think that is really interesting that I have to push the return address register. I assumed the "call" pushed it as part of the call, i.e., as in the difference between br and bl.

I tried to get a handle on the difference between br and bl by reading the ARM manuals and I didn't find anything about being responsible for saving lr prior to the call, so I really appreciate this pointer.

As for the timing, I have used an oscilloscope to very accurately measure the time of the entire call, register moves and return. But thanks for pointing that out. It is important.

Incidentally, the way I do this is to assume the time is y = mx + b + g, where
- b is the call overhead,
- x is the time for a single move instruction
- m is the number of move instructions.
- g is the time it takes to toggle the GPIO pin

Now all I have to do is:
1. call the routine with 100 move instructions in it and
2. use an oscilloscope and toggle a GPIO pin before and after the call
3. call the routine with 200 move instructions
4. measure again with the oscilloscope
5. now just toggle the GPIO pin twice with no function call (which gives me the value of g)

Now I have two equations:
    y = 100m + b + g (from steps 1 & 2)
    y = 200m + b + g (from steps 3 & 4)
    y = g    (from step 5)

y in each case is the total time measured on the oscilloscope.

Now some high school algebra solves for the time of each mov r1,r1 instruction (about 12.3 nsec) and the call overhead (about 71 nsec)

The gpio was the most expensive part, taking about 390 nsec!!

Thanks - Frank

lpcware · ‎06-15-2016

Content originally posted in LPCWare by MikeSimmonds on Wed Jul 23 15:54:57 MST 2014
Inside your asm functions that make any lower level calls (to other asm or to "C" functions) you
have to save the current value of the lr register before a call and restore it after.

This only needs to be done once (before any sub calls) and after the last (by whatever route you
end up at the return from this function).

You also have to preserve any registers other than r0 to r3 that you use (ie change).

This is done via a push ... at the start of the function and changing the bx lr at the end to a pop ... (with pc)

E.g.

.thumb_func
.global SomeAsmFunction

SomeAsmFunc:

push {r4,r7,lr}   // will change r4 and r7. {r5-r7,lr} would reserve r5,r6 and r7. See Asm manual
// do stuff ...
bl Func1
// do stuff
movs r0, 1
movs r1, 5
bl CFunc    // i.e MyCFunc(long x, long y); // r0 = X and r1 = y
//do more stuff
bl Func99
// do stuff
movs r0, 99   // set the return value // optional but must be in r0
pop {r4,r7,pc} // instead of 'bx lr' // function return // items in {...} must match as start with lr changed to pc

If no registers other than r0 to r3 are used, you only need push {lr} and pop {pc}

Get the "ARM v7-M Architecture Reference Manual" from the ARM website
Also check about Application Binary Interface on the site (How to pass parameter and set return value)

PS: if delays need to be 'really' precise, allow for call over head and stack/unstacking too.
PS: movs rather than mov. movs assembles to 2 bytes, mov to 4. But movs does not alter any flag bits.
Some other intructions also have this feature -- see the asm manual.

Hope this helps, Mike.

Function calls from ASM

Function calls from ASM

LPC17xx