OK, you had me going. So, I took out my GB60 demo board and another QG8 board we use for a product (both known to work without any problems
if loaded with correct code), and ran some tests on both. For both boards, the exact same code was used
except for the inevitable differences in MCU specific initialization. All tests were performed in run mode (no debugger attached) to eliminate possible side effects of using BDM.
I first tested with working Flash code that I've been using in numerous and diverse applications without any problems, just to verify my test code was setup correctly for both MCUs. Everything worked perfectly, I could erase, re-write, etc. The original (working) RAM routine is the familiar one:
;*******************************************************************************
; Purpose: RAM routine to do the job we can't do from Flash
; Input : A = value to program
; Output : None
; Note(s): This routine is modified in RAM by its loader at @2,3 and @5
; : Stack needed: 20 bytes + 2 for JSR/BSR
?RAM_Execute sta EEPROM ;Step 1 - Latch data/address
;EEPROM (@2,@3) replaced
lda #mByteProg ;mByteProg (@5) replaced
sta FCMD ;Step 2 - Write command to FCMD
lda #FCBEF_
sta FSTAT ;Step 3 - Write FCBEF_ in FSTAT
lsra ;min delay before checking FSTAT
;(FCBEF -> FCCF for later BIT)
?RAM_Execute.Loop bit FSTAT ;Step 4 - Wait for completion
beq ?RAM_Execute.Loop ;check FCCF_ for completion
rts
;after exit, check FSTAT for FPVIOL and FACCERR
?RAM_Needed equ *-?RAM_Execute
Next, I moved Step1, Step2, and half of Step3 outside this routine, just prior to calling it with this code (at the dotted point, A has the data and HX the address):
;*******************************************************************************
; Purpose: RAM routine to do the job we can't do from Flash
; Input : A = FCBEF bit mask
; Output : None
; Note(s): This routine is modified in RAM by its loader at @2,3 and @5
; : Stack needed: 10 bytes + 2 for JSR/BSR
?RAM_Execute sta FSTAT ;Step 3 - Write FCBEF_ in FSTAT
lsra ;min delay before checking FSTAT
;(FCBEF -> FCCF for later BIT)
?RAM_Execute.Loop bit FSTAT ;Step 4 - Wait for completion
beq ?RAM_Execute.Loop ;check FCCF_ for completion
rts
;after exit, check FSTAT for FPVIOL and FACCERR
?RAM_Needed equ *-?RAM_Execute
...
sta ,x ;Step 1: Latch the data/address
lda #mByteProg ;command to use
sta FCMD ;Step 2 - Write command to FCMD
lda #FCBEF_ ;prepare for Step 3
sei ;disable interrupts
tsx
sta COP ;reset COP
jsr ,x ;execute RAM routine to perform Flash command
ais #?RAM_Needed ;de-allocate stacked routine
lda FSTAT
bit #FPVIOL_|FACCERR_ ;check for errors
beq ?Success
?Error sec
rts
Both boards gave consistent errors in all programming (or erasing) attempts. Not the crash type error one would expect if Flash was momentarily unavailable, but the routine returned with an error condition, but in all cases code continued to run from Flash correctly. After checking the code over and over for even the tiniest possible violation of Freescale's published steps for Flash programming, I couldn't find any.
But, I still didn't give up. I thought if JimDon doesn't see a problem I shouldn't either, so we must be doing something differently. After juggling several things around trying this and that, I had some progress.
The last thing I tried (isn't it always the last place you look?) was to place the COP resetting instruction before the STA ,X that latches the data/address pair (where the dots are in the code above.) I then tried it on the QG8 and, guess what, SUCCESS!
I had finally killed the beast that was hunting me for nearly two years, and I went back and tried it again, this time on the GB60. Unfortunately, not the same results here. It kept failing constistently like it did before this change.
It seems I had only wounded the beast, but it was still loose.Now, keep in mind that COP is the SRS register (at $1800 in both chips). It is NOT flash, so technically, there is no violation by writing to COP after latching with STA ,X. But, surprisingly, it made a difference even if only for the QG8.
Conclusion:
This method is unreliable, it seems whether it will work or not is dependant on mask revisions or other hidden differences, and the fact that writing to COP alone even makes a difference is disturbing. (In some early QT's, writing to COP at $FFFF which was also part of Flash caused serious problems, but in this case there is no excuse.)
Regarding another issue, the 4 cycle delay, and whether we should obey it or not: Like I said
here (second to last paragraph), my experience shows that indeed it works even without it. But, unless one knows the exact internal logic, it is unsafe. Isn't it possible, for example, that just one of the error conditions takes four cycles, while all others take just one cycle? Your code should be able to handle them all. Only reliable source for this is Freescale, unless someone can come up with a test for each and every possible error condition. And given today's results with unexpected and undocumented behavior, I would be very hesitant to use code that works only on certain people's birthdays.
I'll stick with what is working 100% for now, until there is brighter light on this issue.