Direct Page limitations

Szilu · ‎11-16-2010

Hello,

I've been given a project redesing from HC05 to HCS08 , and during that process we need to merge two applications without too much tinkering with the code. One by one everything is OK but once i need to merge the project i run into some serious problems regarding the direct page memory.

After a static analysis we run into the conditional branch limitations (not enough Z_RAM) we solved the first problem by creating macros although i'm having a few problems with the CCR registers but will work it out. INC DEC CLR BSET BCLR instruction aren't too much of a headache from a CCR viewpoint or register preservation viewpoint.

Now using these macros we run into a ping pong effect replacing the branches with macros the code size grew and the relative addressing problem arises . And with every replace the error propagates...

I was wondering if anyone has run into this limitation or knows a way to solve them ?

Ultimatly if there isn'a any sollution i will do it the hard way and will stretch the code size.

I'm mentioning that this is my first project in assembly.

tonyp · ‎11-22-2010

In the example macro I posted, the parameters follow the JSR. First you call the emulation routine, then access the data from within that routine, and update the return PC, according to the execution path you want to take. (Placing the parameters first would cause their data to be interpreted as CPU instructions with highly undesirable results.)

OK, to make the whole thing a little more visual, I have prepared an example for you [and possibly others interested]. (It is written for ASM8 so a lot of my own syntax "peculiarities" won't work with CW, but at least you'll get the general idea of how to go about implementing it.) Still, don't bother with my solution unless you can fully understand it, because I've only given you BRSET, and you will still have to implement (correctly) the remaining ones, plus convert to CW syntax, on your own. Example in (source form, and) assembled listing form. This should serve as general example for any emulated instructions, where the parameters are passed as inline data following the subroutine call.

Your "bullet proof" macro has a few (bullet) holes. LOAD and STORE instructions affect the N and Z flags, while BRSET/BRCLRs only affect the C flag. You also need to balance the stack on return from the subroutine. (And, don't you need immediate mode for loading the LABEL value?)

View solution in original post

bigmac · ‎11-17-2010

Hello, and welcome to the forum.

The fact that you are merging two different applications will probably mean that you will need to do more "tinkering" with the code than you might like. You do not mention which HC05 devices were previously used, and the HCS08 device to which you are migrating. Note that the use of macros will not produce more compact code - their one purpose is to make the code more understandable by virtue that a common sequence of instructions can be replaced by a more meaningful label. Do not confuse a macro with a sub-routine.

The most significant enhancements to the later devices would be:

The ability for the code to directly manipulate the stack, for the purpose of creating temporary variables required by a sub-routine, and
The index size for the X-indexed addressing modes is now 16 bits, rather than 8 bits, by virtue of the additional H-register. This means that an instruction such as LDA ,X can now fetch from anywhere in the normal addressing range, rather than only page 0 RAM. However, your code will now need to explicitly set the H-register value whenever indexed addressing is utilised.

There are numerous other enhancements, but these would seem to be the most important ones that will reduce the requirements for page 0 RAM, and to provide more efficient code. The adaption of the existing code should attempt to make use of these features.

For the HC05, the stack will occupy the top of page 0 RAM. For the HCS08, the stack pointer should be explicitly initialised to the top of RAM, to maximize the page 0 resource. This is accomplished with the following code. Never use the RSP instruction.

_Startup:    LDHX  #RAMEnd+1    TXS             ; Adjust stack pointer

Whenever the code exceeds a conditional branch limit, consider incorporating some of the intervening code within a separate sub-routine (not a macro). It would seem that some of the original code must have borderline, for this to now occur.

The following generic examples give possible sub-routine frameworks for incorporating local variables on the stack. For the first example, intermediate stack operations within the sub-routine would affect the offset required for a variable, should the stack pointer alter.

VAR1  EQU  1VAR2  EQU  2FUNC1:    PSHX       ; Save values to stack, as req'd.    PSHH    PSHA    AIS   #-2  ; Create space for two 8-bit variables    ; Body of sub-routine    CLR   VAR1,SP    LDA   #100    STA   VAR2,SP    LDA   3,SP  ; Original ACC value     DEC   VAR2,SP    ; Etc.    AIS   #2    ; Adjust stack pointer    PULA        ; Restore prior values    PULH    PULX    RTS

Now a slightly different arrangement that allows intermediate stack operations, without affecting the offset for the variables.

VAR1  EQU  0VAR2  EQU  1FUNC1:    PSHX       ; Save values to stack, as req'd.    PSHH    PSHA    AIS   #-2  ; Create space for two 8-bit variables    TSX    ; Body of sub-routine    CLR   VAR1,X    LDA   #100    STA   VAR2,X    LDA   2,X  ; Original ACC value     DEC   VAR2,X    ; Etc.    AIS   #2    ; Adjust stack pointer    PULA        ; Restore prior values    PULH    PULX    RTS

Application note AN1218 HC05 to HC08 Optimization, may provide further explanation of using the enhanced instructions. Additionally there is AN2717 M68HC08 to HCS08 Transition, which describes the differences within the HC05 -> HC08 -> HCS08 migration.

Regards,

Mac

Szilu · ‎11-17-2010

Thanks for the replies i will try to look throught the macro code if i can cut something down and i will post most of them here . I'm migrating from HC05B32 to 9S08DZ60.

I already did the analysis for the variables in the Z_PAGE and we will try to order them by using a ponderate formula giving an overhead in bytes for every direct page instruction used and try to figure out how much the code will grow.

Thank you for you prompt response and as soon as i finish i will post the macros here so others may use it as well. Yesterday i started to make some functions using a memory location in the z page to be able to use the particular instruction and use the jmp instructions in order but i find that most of my macros have the same overhead as the preparations i need to make to jump the the subroutine equivalent with it.

We had come up with the short version and long verion of BRSET and BRCLR one uses relative addressing and one uses JMP.

I'll edit the post once i get to the bottom of it probably friday since i have some other tasks.

bigmac · ‎11-17-2010

Hello,

In addition to the page 0 limits, it seems that you are also concerned about your code size. My impression would be that, if the code size will potentially exceed 60K of assembly code, this is indeed a very large project. Probably not something I would not have been confident to tackle as a first assembly project.

With respect to the BRSET and BRCLR requirements, perhaps you should also consider macros that provide this functionality with extended memory, in addition to providing a long branch option. This would potentially ease the page 0 allocation issues. I do not know what assembler you are using, but the following examples make use of the CodeWarrior macro definition format.

; Direct memory, long branch:L_BRSET:  MACRO             ; Bit,Reg,Branch          BRCLR  \1,\2,*+6  ; Skip next if clear          JMP    \3          ENDML_BRCLR:  MACRO             ; Bit,Reg,Branch          BRSET  \1,\2,*+6  ; Skip next if set          JMP    \3          ENDM; Extended memory, short branch:E_BRSET:  MACRO             ; Bit,Reg,Branch          PSHA          LDA    \2          BIT    #(1<<\1)          PULA          BNE    \3          ENDME_BRCLR:  MACRO             ; Bit,Reg,Branch          PSHA          LDA    \2          COMA          BIT    #(1<<\1)          PULA          BNE    \3          ENDM; Extended memory, long branch:EL_BRSET: MACRO             ; Bit,Reg,Branch          PSHA          LDA    \2          BIT    #(1<<\1)          PULA          BEQ    *+5        ; Skip next          JMP    \3          ENDMEL_BRCLR: MACRO             ; Bit,Reg,Branch          PSHA          LDA    \2          COMA          BIT    #(1<<\1)          PULA          BEQ    *+5        ; Skip next          JMP    \3          ENDM

For the extended memory macros, Z and N flags within the CCR will be affected, whereas they are not affected by BRSET and BRCLR instructions. Additionally, C flag is not set or cleared to reflect the bit state.

Regards,

Mac

Szilu · ‎11-19-2010

Ok here we are with our macros the application isn't that big only around 14k each but both of them make heavy use of Z_RAM here are our macros and their usage we decided on a hybrid solution:

We take care of CCR values only generated for A and for X you need to make sure u use a CMPX or TSTX before using conditional branching on it. We tried to cut every ounce of byte but this is where we stopped. We tried to make it as safe as possible and understandable as it can be.

As I said we ditched the Subroutine implementation as it would bring us almost the same code size +3/4 bytes but with more husttle then this method since you need to prepare for it so with using these macros and respecting some simple rules it should be alright.

Thanks for all the suggestions it helpped me a lot. I think there is nothing more to add to it.

tonyp · ‎11-19-2010

On first look, some of the possible issues I see with your macros:

By the way, as a general note, any time you replace a real instruction with a macro that does not behave 100% like the original instruction (and, in rare situations, even to the point of executing as an atomic instruction), you're headed for trouble, since you increase the probability of unforeseen side effects; even more so, if the original code wasn't yours. One example where this might happen: Using BRSET/BRCLR to branch to immediately next instruction just for the purpose of adjusting the Carry based on external pin input, commonly found in serial comms routines.

1. Special case BRxxx to self using BRxxx Bit,Address,* won't work with your current macros. For example, waiting for a status flag to settle, a quite common use of these instructions. (My own example typed in haste lacked this provision, also.) This may not apply if all such references are in zero page using normal BRxxx instruction (no need for macro use).

2. Using masks instead of bit positions. Why not follow bigmac's example and create the mask on the fly using #(1<<(\1))? (That way you maintain the use of bit positions and stay consistent with the original code. The fewer changes you make to the original code, the less room for error. If using plain numbers for masks, some numbers [like 1,2,4] may mean bit position in one place and mask in another. If the conversion isn't done in a most systematic way making sure you don't skip any instructions along the way, after a while you won't be sure, whether you had already changed the original number from position to mask or not; a really ugly situation.)

3. For BRxxx, not adjusting the carry according to the tested bit (as I mentioned earlier).

Good luck with your porting project.

P.S. Although I understand it's not the way you want to go (possibly because of the extra work involved and you being new at assembly language), the hardcoded parameter method I mentioned previously gives the best per-call size optimization (which in turn reduces the number of out-of-range-branch side errors throughout the code), and the easiest way to avoid unwanted side effects by making sure the subroutine leaves all CCR bits as they were expected by the original programmer. This is easier done in a subroutine than inline inside a macro where you're trying to keep the size as short as possible. You can even optimize my previous example further (down to 4 extra bytes for BRxxx) by combining (at the macro level) BitPos and VariableAddress into a single word (BitPos is only 3 bits [0..7], leaving upto 13 bits for address, which is good for any variable in the first 8K of RAM). But, I guess you don't care for more complications in your life, already

Szilu · ‎11-22-2010

Thanks tony for the suggestions i edited the macros using bigmacs tricks.

But i'd want to discuss about a previus post of yours :

BRIFSET             macro     BitPos,Variable,JumpTo          #ifz ]~2~                    brset     ~@~         ;no emulation when possible                    mexit          #endif          #if ~1~ > 7                    #Error    BitPos not in range 0..7                    mexit          #endif                    jsr       ~0~                    dw        ~3~         ;JumpTo                    dw        ~2~         ;Variable                    fcb       ~1~         ;BitPos                    endm

I wonder how are you giving the parameters to your subroutine.

At the end of the macro i think you meant

dw ~3~

dw ~2~

fcb ~1~

jsr ~0~

But i would like to ask you how does my subroutine know the values that he needs to work with?

Or how does it work?

I figured that there is only this way to make a bullet proof emulation

1)reserve 4 bytes of direct page for data

2 bytes for the label i want to jump

1 byte the current value of the register

1 byte for the flag

2) create a macro like this

PSHH

PSHX

LDHX LABEL

STHX Direct_page_location_label

LDX Register_high_page

STX Direct_page_location_reg

LDX Flag

STX Reserved_location

JSR EXT_BRCLR

20 bytes already

3) in the subroutine will emulate the instruction since as you said it's written only once at a time

it's size doesn't really bother me

I taken into account all your hints ; waiting for a flag will mostly remain in direct page since i will use these instruction only if the variable is pushed outside of direct page.

tonyp · ‎11-22-2010

In the example macro I posted, the parameters follow the JSR. First you call the emulation routine, then access the data from within that routine, and update the return PC, according to the execution path you want to take. (Placing the parameters first would cause their data to be interpreted as CPU instructions with highly undesirable results.)

OK, to make the whole thing a little more visual, I have prepared an example for you [and possibly others interested]. (It is written for ASM8 so a lot of my own syntax "peculiarities" won't work with CW, but at least you'll get the general idea of how to go about implementing it.) Still, don't bother with my solution unless you can fully understand it, because I've only given you BRSET, and you will still have to implement (correctly) the remaining ones, plus convert to CW syntax, on your own. Example in (source form, and) assembled listing form. This should serve as general example for any emulated instructions, where the parameters are passed as inline data following the subroutine call.

Your "bullet proof" macro has a few (bullet) holes. LOAD and STORE instructions affect the N and Z flags, while BRSET/BRCLRs only affect the C flag. You also need to balance the stack on return from the subroutine. (And, don't you need immediate mode for loading the LABEL value?)

tonyp · ‎11-17-2010

Szilu wrote:
Yesterday i started to make some functions using a memory location in the z page to be able to use the particular instruction and use the jmp instructions in order but i find that most of my macros have the same overhead as the preparations i need to make to jump the the subroutine equivalent with it.

Perhaps you're trying to do the macro in the straight-forward way, e.g., push whatever registers you need to protect, load the parameters to the registers, push some of those parameters on the stack (since registers can only hold so much), make the subroutine call, and finally reverse all pushes. Yes, that would take significant space.

But, there is a less obvious way, which produces much shorter code (still larger than the actual instruction you need to emulate, but significantly less than what I described above). Pass parameters as hardcoded data. Something like this example (for ASM8, adjust as needed for your assembler's syntax):

BRIFSET             macro     BitPos,Variable,JumpTo
          #ifz ]~2~
                    brset     ~@~         ;no emulation when possible
                    mexit
          #endif
          #if ~1~ > 7
                    #Error    BitPos not in range 0..7
                    mexit
          #endif
                    jsr       ~0~
                    dw        ~3~         ;JumpTo
                    dw        ~2~         ;Variable
                    fcb       ~1~         ;BitPos
                    endm

The subroutine will need some extra overhead to update the PC to return to (skipping the data if the branch is not taken, or using the value of the JumpTo parameter if the branch is to be taken). This approach uses just 5 more bytes (3 for BSET/BCLR) than the actual BRSET/BRCLR instruction. The subroutine can be as long as required to create a perfect emulation, including CCR register side effects (there will only be one copy of the subroutine, so assuming there are multiple BRSET/BRCLR in your app [or you wouldn't bother], the savings are significant).

If not used for other purposes, you could save at least one byte per call if you used the SWI instruction (and related handler) with a single byte operand indicating what instruction to emulate plus whatever hardcoded parameters, as above. You will also save some push/pulls inside the subroutine (only H needs protection).

tonyp · ‎11-16-2010

A few suggestions (that you may already know about):

* Sort your variables by "frequency of use". Place the most frequently used variables from either application in zero page RAM. By most frequently I mean in terms of references within the code, not in terms of accesses (for example a variable accessed 1000 times -- inside a loop, perhaps -- in just one subroutine should probably go into non-zero-page RAM [except for cycle-tight loops], while a variable accessed once but in 100 different places should definitely go into zero-page RAM.) This will achieve the greatest possible average instruction length reduction. I think this is the single most important optimization you can do without getting into any code changes, just re-order the variable definitions.

*Given that the HC05 couldn't use the stack for local variables, everything had to be global. But with the HCS08, those variables can become truly local, allowing you to use PSH/PUL operations or even X index mode which should produce shorter code for many common operations. You can also use INC / DEC etc as you do with zero page but with X indexed mode, while you can't use several zero-page capable instructions with extended addressing, forcing you to go thru the A register, inflating your code (specially, if you must also preserve A). (B[R]CLR/B[R]SET is an exception as it does not allow indexed mode.) Another size side-effect is that as some of these variables disappear from the global space, it increases the likelihood that other variables will end up in zero-page.

* Out-of-range branches should be easy to deal with using simple macros: eg., JEQ is BNE *+5 followed by JMP (yes, code size will increase). Another option is to add JMP hooks where appropriate to minimize the code inflation problem. It wasn't clear if any of this is what you're doing already.

* Maybe you could care to post the macros you use. A simple optimization in them could also save significant code. Also, consider using macros that call subroutines whenever there is a size benefit, even if you lose on speed. For example, you could write B[R]SET, B[R]CLR equivalent subroutines but only use the macro to setup the call to these routines, not to do the whole operation inside the macro.

Hope this helps.

Direct Page limitations

Direct Page limitations

General