QT4 security bytes being erased

irob · ‎02-16-2006

I'm a FreeGeeks refugee, so I'm here now posting on Freescale's forums. Curiously absent in the forum imports is my lengthy thread on security byte erasure. Have a look at the FreeGeeks archives for reference. There are a bunch of good replies/theories that didn't pan out for me. I'm posting a modified version of that original with all the latest findings:

I'm using both QT2 and QT4 MCU on one common gerber design. My original firmware was designed for the QT2. Newer firmward utilizes the ROM-resident flash programming routines and as a result only fits in the QT4. Same fab is shared between firmware versions.

Symptom:

Under certain environmental conditions, target board fails a standard test. It was discovered that under these conditions, the jump vectors were being erased. As such, since the reset vector was reset to FF, the MCU was attempting to enter monitor mode and thus never executing user code. Curiously, the user code is NEVER affected, only the jump vectors.

Observations:

1) These failures almost seem ESD related. In isolated tests, we've been able to duplicate the failures by simply walking around and then touching the target.

2) The failures are only resident in my new firmware version (which uses the ROM-resident flash programming routines for memory storage).

3) If I load up these same QT4 targets with old firmware (non ROM-resident utilizing versions), I've never witnessed a failure.

4) Even after programming the flash block protect register (FLBPR @ $FFBE) this problem persists, which should be impossible.

5) There is a mask set errata published for the QY4 in which there's a section headed "Page Erase Can Cause Unexpected Erase of a Another FLASH Page". That's a very close theory, but there are some holes:

a) my mask sets are quite newer than those published in the errata
b) I'm not using interrupts
c) I'm resetting the COP often.

Any thoughts?

Alban · ‎02-16-2006

Hi iRob,

Please don't see anything suspicious here, no newer post from Freegeeks was copied over... Mea Culpa if it's an older one because I was contacted to select quite a few posts as I was Mod on Freegeeks.net and I'm certainly no perfect.

Also when you read Yahoo forums, some contributors are really virulent against Freescale for having copied posts, so maybe they stopped. Only time will tell !

But let's continue on your subject Coz I don't want people to be afraid to post here or think a Mod will cut stuff out. If it happens I'll just give my Mod's cap back to the Admins.

The FBLPR is checked before Flash operation. As far as I know, even on devices touched by the Erratum, you wouldn't see the problem with FBLPR programmed to protect the vector page.

Which masks are you using ?
Do you know if the rest is still stored in memory ?
Do you see that because the applications is going stupid OR only when you put the programmer back on ?
Which programmer are you using ?

My idea behind the 3 is that I've seen the newer QY are not behaving the same with security and maybe the programmer doesn't understand. In most of HC08, when the Security Fails, Flash memory will read as 0xAD. However in some QY, it read 0x00.
I did cause trouble to my Multilink because it was always taking back the 0x000000000 after a failure because it thought it was OK. An update of the drivers cleared it out.

Cheers,
Alban.

Message Edited by Alban on 02-16-2006 09:32 AM

irob · ‎02-16-2006

Which masks are you using ?

Most of my MCUs are RMAE515. The errata mask set is: 3L69J. Not sure how those correlate, but most of my MCUs were produced the 15th week of 2005. Very recent.

Do you know if the rest is still stored in memory ?

Do you mean, is the rest of my program code still intact, even after a failure? Yes, that is always the case. Only security bytes are erased.

Do you see that because the applications is going stupid OR only when you put the programmer back on ?

The former. I know this because in my application, it's very obvious when a failure happens. I have an I2C peripheral that suddenly stops getting communication from the MCU. This is the symptom. I then check it on the programmer and verify that the code is intact, but security bytes are all erased to 0xFF.

Which programmer are you using ?

The P&E Cyclone Pro. I already issued a service request to P&E about all this. They basically told me it's Freescale's problem. I've got an open SR with Freescale. They want more data.

In most of HC08, when the Security Fails, Flash memory will read as 0xAD. However in some QY, it read 0x00.

Understand that when I first inspect a failed board with the Cyclone, security check fails. That's the first confirmation that the symptoms are the same. I then change the security check to all 8 bytes 0xFF (the erased state) and I can enter monitor mode and verify the rest of the user program.

Do you think this is an programming algorithm problem?

Alban · ‎02-17-2006

iRob,

I don't know what this RMAE515 means. Never seen tis before. Thanks for mask.
Yes, I meant that
Good, so we can eliminate the programmer as being responsible !
Good stuff, I'm pleased with it, even if I don't use it often now I have the FSICE (new MMDS).

I have two ideas just now coming to my mind... FBPR = Flash Block Protection Register + FPR = Flash Programming Routines and how they're used in the soft
Indeed, it looks like there is something weird with the programming.

Do you erase and then program ?
Yes, it could be from one or the other. If no, which one ?

Can you post the line of your S-Record showing which value is programmed in the FBPR (Flash Block Protection Register) @ 0xFFBE?
Even the buit-in flash programming routines are not supposed to override this protection, they are just in ROM to save user Flash space.
By making sure the FBPR is programmed to protect the vector page when using the Cyclone Pro, it will confirm if the FBPR function is OK.
If you program the FBPR in your soft, we can't really believe it as the vector page is erased and the FPR can be to be blamed. If you see what I mean.

If you can easily and quickly reproduce the fault:
Can you please tell us if the vector erase happens because of the Erase Range or Program Range routine? To learn this, you would need to add a comparison on the reset vector to 0xFFFF just before after each function with a unique display on some port depending on what you see.
Once this is done and we see which one is causing trouble, we need to analyze how you call these functions as maybe something else is corrupting the stack when you do the function call !!

My goal would be to see where the fault is and if there's any problem with the on-chip FPR, we should be able to protect your application with FBPR .

Cheers,
Alban.

irob · ‎10-24-2006

Sorry I'm responding so late to this. But the problem has flared up again, so I'm needing to address it.

Alban wrote:
Do you erase and then program?

Yes, every time. Our programming cycle consists of:

Erase Device
Blank Check Device
Program Device
Verify Device

Alban wrote:
Can you post the line of your S-Record showing which value is programmed in the FBPR (Flash Block Protection Register) @ 0xFFBE?

Here's that line in my s-record:

S104FFBEF648

As you can see, the FBPR is clearly written to FFBE, and is set to 0xF6. According to the datasheet, this decodes to a starting address in flash as 0xFD80. My software writes user parameters to 0xFD40. This gives me one full page to write to.

Alban wrote:
If you can easily and quickly reproduce the fault:
Can you please tell us if the vector erase happens because of the Erase Range or Program Range routine?

No, I've never been able to reliably reproduce the error in the lab. However, I have much evidence that shows this error only started happening after I added EEPROM emulation with the ROM-resident routines. In the meantime, I've removed the erase function on a few sample targets and deployed them in the field. We'll see what happens.

Alban wrote:
...To learn this (Erase Range or Program Range causing error), you would need to add a comparison on the reset vector to 0xFFFF just before after each function with a unique display on some port depending on what you see.

I'm not clear what you mean by "adding a comparison on the reset vector to 0xFFFF". Realize that when this fault happens, the reset vector is reset. What good would it do to add that comparison if the MCU is lost out of reset?

Message Edited by irob on 2006-10-2402:34 PM

Alban · ‎10-24-2006

Hi iRob,

That's indeed an old one.

I read the whole thread again and there is something I'm thinking about.

The way the Flash prog/erase works on this is the address on the bus is latched to erase the matching page.

If you kick the COPs' @ss at this point, you will latch the last page address instead of the one of the dummy read you did ! I'm only working from memory here so may say something strange.

To clarify my last comment, I was saying that if before and after each of your prog/erase function you check the value of the reset vector is not erased, you will be able to detect which of your function does erase the vector page.

From what I read in your new message, it looks like removing the erase function does solve the problematic behaviour.

This would tend to confirm my bla-bla above.

If you use ROM Flash Erase routine and have no control over the COP within, you would have to create a new clean one in Flash.

Solution would be NOT to kick the COP in the erase step where the address on bus is latched.

If it's too long for your COP, try and see with the LONG COP period.

OK ?

Cheers,

Alban.

bigmac · ‎10-25-2006

Hello iRob,

Further to Alban's suggestion to monitor when the vector erasure occurs in relation to the operation of your program, in order to help diagnose the problem, here is a suggestion -

Should the vectors become erased, normal operation of your program should continue up to the point where an interrupt event occurs, or there is a reset. I am suggesting to allocate an additional byte, within the flash page that holds your non-volatile data, for the purpose of writing an error code. Then implement the following sequence within the portion of code that handles the writing of data -

Disable interrupts

Initialise a test counter

Vector test routine - just in case there is some other cause

Increment test counter

Erase flash page

Vector test routine

Increment test counter

Write flash data

Vector test routine

Repeat for other flash pages to be erased and written.

Re-enable interrupts, as required.

The vector test routine would simply test the byte value at address $FFFF, and if found to contain the value $FF, write the current test count value as an error code, and then cause an immediate reset (perhaps by the illegal instruction method).

Subsequent examination of the flash contents would enable you to identify the position in code where the vector had become corrupt.

A while back I did disassemble part of the ROM code of a QY device from my existing stock ( of more recent manufacture), and found it corelated to the code listing given in AN1831.

Incidently, I have been using the ROM routines, without apparent problem, in a mature product of at least three years. I erase the first page of flash, and then program the first two bytes only. However, this will generally occur only a few times during the product lifetime - perhaps I have been lucky. Long COP timeout is enabled, and I reset the COP timer a few instructions prior to the ERARNGE sub-routine being called.

Regards,

Mac

irob · ‎10-25-2006

Thanks for the debug ideas, Big Mac. I like the idea of using a test routine in the field. I also like your idea of reseting the COP right before flash writes/erases.

Incidentally, is the COP an issue with S08 parts which have the radically different flash architecture? I'm wondering if I should implement a similar COP reset right before data storage and erasure. I guess it couldn't hurt, huh?

Alban · ‎10-26-2006

Yo,

The S08 Flash is different in the way to deal with it.
It's a state machine and writing to the COP during erase will not corrupt anything.

Alban.

Alban · ‎10-24-2006

Reading twice is quite good.

I do confirm what I just wrote because of points 2 and 3 of your original post + me reading Erratum

The ROM Erase routine probably refreshes the COP and latches 0xFFFF.

Do not use internal Flash Erase routine on this mask. Create your own in Flash.

Do not refresh Ze Cop in the middle according to Erratum of the 3L69J you mentioned.

This way you will be fine !

Alban.

irob · ‎10-24-2006

Thanks for the replies, Alban, much appreciated!

A couple more questions:

1) I looked in my source code archives and examined all the versions which include ROM-resident EEPROM emulation. They all have the following settings in CONFIG1: 0x11, which decodes to LVI disabled and COP disabled. Looks like I already disabled the COP module, per the advice in the errata.

Now if the ERARNGE() function in ROM attempts to write to 0xFFFF to reset the COP, is it still possible that the address is being latched, even if I've disabled the COP module?

2) Do you have any sample code or perhaps the source for ERARNGE() in the ROM?

Alban · ‎10-25-2006

iRob,

1- Totally. It is not linked to the COP functionality but to the address present on the bus. Even with a COP disabled, if the software writes at 0xFFFF, that address could be latched.
You would have a similar behaviour on another page if the soft was accessing data somewhere else in Flash.

Alban.

bigmac · ‎10-25-2006

Hello iRob,

Have a look at AN1831 (Rev 3) and AN2635 for the source code for the ROM based programming routines.

Regards,

Mac

irob · ‎10-27-2006

bigmac wrote:
Hello iRob,
Have a look at AN1831 (Rev 3) and AN2635 for the source code for the ROM based programming routines.
Regards,
Mac

Mac, the ERARNGE functions listed in both of those app notes are considerably different from each other. Strangely, there's reference in the AN1831 to "page erase step 1 through step 6". Yet, I can't find these steps in that document. However, these steps are clearly laid out in AN2635's ERARNGE.

irob · ‎10-27-2006

Anyone know what the values of these bytes are:

mERASE
mMASS
mHVEN

They are referenced in both App Notes that Big Mac mentioned. I'm guessing they are the masks of the bits in FLCR, but they're not listed as equates in either pdf.

bigmac · ‎10-29-2006

Hello iRob,

This is a summary of my observations for the ERARNGE function -

AN1831 implies that the QT/QY series does not have a page erase problem with its version of the ERARNGE routine, but many ofther listed devices do have a problem. The source code included within the note is applicable to the problem devices, and is apparently not a "corrected" version, because the work-around code is then described. The ERARNGE routine clears the COP timer a number of times within a timing loop, and once again when the loop completes. This occurs after the "write to any address within the erase range" has occurred, and the HVEN bit is set.
When I disassembled the code for the ERARNGE routine contained within a more recent QY device, I found exact corelation with the "bad" code of AN1831 (including the COP timer resets).
AN2635 is applicable to the QY4A series devices, and the ERARNGE routine is not unexpectedly slightly different than that for the QY4 series. The source code given is for the LB8 device, with the assumption that it would also apply to the other devices listed in the note. The testing of the bulk erase bit is different, and the Tnvs delay does not call the DELNUS routine, but otherwise the differences would appear to be minor. It is interesting to note that the COP reset occurs in exactly the same manner as for the AN1831 code.

Obviously, the occurence of COP timer reset within ERARNGE is not necessarily causing a problem, since it occurs after the address information has been latched, and also remains present for the more recent devices. I can only surmise that the issue lies within the operation of the hardware for the problem devices - perhaps the erase address is capable of being "re-latched" for some, but not for others.

This seems to make the cause of the erased vectors for the QT4 device even harder to plausibly explain, assuming any hardware issues have been corrected for later mask sets. There is certainly no explanation why the problem should be an intermittent one.

Anyone know what the values of these bytes are:

mERASE
mMASS
mHVEN

You should find the values for these bit masks within the .inc file for the particular device.

Regards,

Mac

irob · ‎10-30-2006

All true, Mac. Especially in that the QT/QY parts have never been listed by Freescale to have the problematic vector erasure when using ERARNGE.

But given the breadth of the problems I'm seeing in the field and frustration of my customers, I'm forced to at least try patching firmware with the measures recommended to the other MCU families. Short of a recall and hardware patch (adding external EEPROM or updating to newer QG4's), this seemed the speediest solution.

Onto ERARNGE, there seems to be at least two versions between AN1831 and AN2635, as those two copies are quite different from one another. QT/Y's are only mentioned in AN1831, so I guess I'll stick with that version.

Message Edited by irob on 2006-10-3010:27 AM

Alban · ‎10-31-2006

All true, Mac. Especially in that the QT/QY parts have never been listed by Freescale to have the problematic vector erasure when using ERARNGE.

An Erratum is defect in the device about a specified information from the datasheet.

The device datasheet does not mention the use of these functions but AN1831 does.
There is no AppNote Erratum in place. Instead, the App Note is usually corrected.
I'm contacting the Technical Publication.

For ERARNGE, give me your mask set (3L69J ?) and I shall find you some code.

Alban.

Alban · ‎11-01-2006

iRob,

I've been contacted offline by a Freescaler on this.

I made a mistake by generalizing the Erratum from 3L69J.

Generally QY/QT are not 3L69J, therefore AN1831 is correct.

You are likely to only have newer products (like the RMAE515 and others which look like lead free parts) and ERARNGE would be corrected.

Have you tried to implement former propositions ?

Cheers,

Alban.

irob · ‎11-01-2006

I'm working on that right now. With much help, I have my new erase function now compiling.

I haven't been able to see it work however. As a test, I wrote a sample firmware that burns test bytes into my flash "EEPROM" page and then my user code simple calls the new ERARNGE. For whatever reason, nothing gets erased like the ROM-resident version works.

But now I'm going to port this over to my DEMO908QG8 board to trace it.

irob · ‎11-01-2006

I'm so far unable to get my own ERARNGE function to do anything. I've started comparing my code (grabbed right from AN1831 with a few modifications for working with the CW5 include files) to Big Mac's attached sample code.

One noticeable difference is in his line 424 of ROM_CODE.asm:

BRCLR 6,CTRLBYT,*+5 ; Skip next if no mass erase

and my comparable line:

BRCLR mFLCR_MASS,aCTRLBYT,AMBS

The value of mFLCR_MASS is defined in MC68HC908QT4.inc as:

mFLCR_MASS: equ %00000100

That's $04, not 6, as in Big Mac's code. This all boils down to what is meant by "MASSBIT". In the original AN1831 code:

BRCLR MASSBIT,CTRLBYT,AMBS

Freescale uses "MASSBIT" and of course never defines what that is anywhere. But it seems obvious to me that they are referring to the mask for the MASS bit.