Flash Corruption

manishsangram · ‎10-07-2019

We are using S12ZVMC128 chip for our motor application.

We are facing a strange issue with the flash getting corrupted. The chip is okay and we do not have any boot loader. We flash the chip using PEMicro multilink programmer.

Further we do not have any FLASH being written to after programming within the application code either.

This is very frustrating because after re-flashing such a chip it works fine, until this issue re occurs. Meaning there is nothing physically wrong with the flash.

Flash is both "Secured" and "Protected" and there is no BackDoor key nor boot loader.

What could cause this issue?

kef2 · ‎02-05-2020

Hi,

ME is not ISR (interrupt service routine), it's MER (machine exception routine), which as was told can't return normally, no return address is stored on the stack.

Here's how I do it, in production quite a long:

return from MER requires either code disassembly at address pointed by MMCPC register or something else. We need to get instruction length, which is pointed by MMCPC. S12Z instruction can be anything from 1 to 11 bytes. Knowing instruction length you could skip past it and return, so that machine exception doesn't fire again and again on the same CPU instruction.

Instead of disassembly or calculating instruction length I do it simpler, I just put enough NOPs past the EEPROM access C line and return from MER to anywhere between those NOPs.

EEPROM read function looks like this

void ReadEepromByte(void *src, char *dst)
{
*dst = *(char*)(src); // read data
asm {
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
}
fixecc(src, sizeof(char));
}

char EeepromECCUncorrectable;

static void fixecc(void *addr, int size)
{
if(EeepromECCUncorrectable)
{

// get EEPROM sector aligned address
  unsigned long strtaddr = (long)addr & ~(4-1);
  unsigned long endaddr = (long)addr + size;

// erase first sector
EEpromErasesect( (void*)strtaddr );

// in my case size is up to 8, sizeof(long long)

// check if size crosses sector boundary and erase 2nd sector

if((endaddr - strtaddr) > 4)
   EEpromErasesect( (void*)(strtaddr+4) );

  EeepromECCUncorrectable = 0;
}
}

MER handler:

#pragma NO_ENTRY
#pragma NO_EXIT
static void MachineException(void)
{
asm{

// create RTI stack frame
LEA S, (-3,S)
PSH ALL

// check MMCRCH.TGT

  LSL.B D0, MMCECH, #4
  CMP  D0, #3 << 4
  BNE  L1
  // ME fault was caused by EEPROM ECC
  ST  D0, EeepromECCUncorrectable
  LD  X, MMCPC
  LEA  X, (11,X) // longest S12Z instruction is 11 bytes long. please fill location
      // after EEPROM read access with enough NOPs
  ST  X, (26, S) // store RTI return address
  RTI    // return to
L1:
}
// for everything else go loop forever or until COP timeout
for(;;)
{
  int a;

  a++;
  if(!a)
   DEBUGPIN25_PORT ^= DEBUGPIN25_PIN;
}
}

Edward

元の投稿で解決策を見る

manishsangram · ‎02-02-2020

Hello Radek Sesta,Matej Pacha,

Is there a tool/way to only verify the flash in an S12Z (without overwriting or attaching it to run) with an en existing .elf file and highlight the differences? In the context of the original problem, we would like to compare the flash from the device to the original elf to see what has changed.

RadekS · ‎02-03-2020

Hi Manish,

you may use the Flash Programmer task for that.
1. go to menu->Window->Show View->Target Tasks
if this option isn't visible in the list, select Other... and search it in Debug folder.

2. create a new target task

Choose name, select appropriate debug configuration and select Flash Programmer for S12Z as task type:

3. add your MCU as a device

4. Add program/verify action

select path to the elf file and click on add verify action

5. Select created target task and run it

6. display Flash Programmer Console for more details

I hope it helps you.

Have a great day,
Radek

manishsangram · ‎02-04-2020

Hi RadekS‌

I did verify as given above however it did not detect any error. The text is given below (I am not sure if it actually verified as memory wise details are not provided)

fl::target -lc "MyPROG_FLASH_PnE U-MultiLink-128"
fl::target -b 0x1000 0x800
fl::target -v off -l off
cmdwin::fl::device -d "MC9S12ZVMC128_FLASH" -o "128Kx32x1" -a 0xfe0000 0xffffff
cmdwin::fl::image -f "MyPROG.elf" -t "Auto Detect" -re off -oe off
cmdwin::fl::verify
Beginning Operation ...
-------------------------
C:\Code\MyPROG.elf
Performing target initialization ...
Flash Operation. ...
Auto-detection is successful.
File is of type Elf Format.

Device MC9S12ZVMC128_FLASH
Verify Command Succeeded

I even verified the EEPROM and it gave the same Verify Command Succeeded message, however it is surprising because the EEPROM has been written to during program execution so the values in EEPROM should not be the same as when they were originally flashed into the MCU!

Next I 'Attached' to debug and found that MCU goes into VME (machine exception).

The first line where there is a problem is in _EntryPoint in CPU.c code generated long ago and otherwise working perfectly.

setReg8(CPMUSYNR, 0xF1U); ff0f02: 0CF106C4 MOV.B #-15,1732

If I skip this line (jump over) then the next error occurs at

EEp_GetWord()

Initially I thought (the above is the function generated to read from EEPROM using ProcessorExpert) caused Vme because I had to skip the above line setting CPMUSYNR so that the timing is off. Even this code otherwise is of course working perfectly.

However...

VMe occurs when RETURNING from the function!!!! at the line in PE generated code in the GetWord function at the following line.

*Data = *Addr; /* Return data from given address */
return ERR_OK; /* OK */

I even checked that the *Data has the valid value from *Addr, so the data copy did not fail but the return ERR_OK failed !!! Just can't understand this!

manishsangram · ‎02-04-2020

Further to the above :

When VME occurs,

FERSTAT = 3

Flash Error Status Register

Bit Field Values:
bits[ 7:7 ] = 0
bits[ 6:6 ] = 0
bits[ 5:5 ] = 0
bits[ 4:4 ] = 0
bits[ 3:3 ] = 0
bits[ 2:2 ] = 0
DFDF bits[ 1:1 ] = 1 Double bit fault detected or a Flash array read operation returning invalid data was attempted while command running.
SFDIF bits[ 0:0 ] = 1 Single bit fault detected and corrected or an invalid Flash array read operation attempted

manishsangram · ‎02-04-2020

Assuming for a moment that issue has something to do with EEPROM corruption. Is there any way to protect EEPROM like FLASH and then open a window to write to EEPROM when required and close the window again?

Also is there a way to format only the EEPROM area from the debugger shell? I want to format the EEPROM only from outside as if it was a freshly 'flashed' device and see if the problem get's resolved. Alternatively is there a way to only write the EEPROM from the elf file (which in our case is same as formatting)

RadekS · ‎02-05-2020

Hi Manish,

I am not aware of such a straightforward editing EEPROM feature. You have to execute a flash algorithm for erasing/programming Flash/EEPROM.
I am also not sure now with the debugger shell.

However, you may use Target Task with Flash Programmer for such operations.

Create a new task according to the guide above and simply use erase action followed by program action instead of verifying.

Note: as a source, you may even use S-record file instead of elf file. You may also restrict erasing/programming only at a specific memory range.

I hope it helps you.

kef2 · ‎02-05-2020

When ME happens you should check MCECH register. TGT field says was it caused by problem in EEPROM or not in EEPROM.

EEPROM write/erase protection won't help you since most likely issue happens due to incomplete write/erase on sudden power down/reset. ECC failure due to double write without erase is easy to handle by proper coding. But backing up power until EEPROM command completes is more tricky to implement. Even if you implement it in HW and SW, you still should have some code to recover from ECC errors in EEPROM. Since there are no registers to tell where EEPROM ECC error occurs, the best way in my opinion is to register which EEPROM cell you are going to read, unregister after read. If ME with MCECH.TGT==3 happens in between - erase registered cell (with broken ECC) immediately, then perform (tricky) return to EE read routine or reset if you wish.

Edward

manishsangram · ‎02-05-2020

Is there any reference or tested code for ME ISR to recover. I fully understand that such code will be specific to each application but to get a general sense and some code to start with, it would be great to get hold of such an ISR. Even if it only handles EEPROM it would be great!

RadekS · ‎02-05-2020

Hi Manish,
Unfortunately, I am not aware of any official code for ME ISR.
I can imagine simple code with testing MMCEC register (e.g. switch case code) with appropriate actions per different error sources.

The actions are fully application dependent and there may be also specific code snippets just for debugging in the special mode (like capturing CCR and PC values from MMCCCR, MMCPC registers,…).

Only the exit from ME is limited by missing core register values and return address at stack (no stacking when ME is invoked). So, we cannot use standard RTI instruction as in normal ISR. MCU reset or jump to a specific address is the way how to safely leave the ME ISR.

I hope it helps you.

Best regards

Radek

kef2 · ‎02-05-2020

Hi,

ME is not ISR (interrupt service routine), it's MER (machine exception routine), which as was told can't return normally, no return address is stored on the stack.

Here's how I do it, in production quite a long:

return from MER requires either code disassembly at address pointed by MMCPC register or something else. We need to get instruction length, which is pointed by MMCPC. S12Z instruction can be anything from 1 to 11 bytes. Knowing instruction length you could skip past it and return, so that machine exception doesn't fire again and again on the same CPU instruction.

Instead of disassembly or calculating instruction length I do it simpler, I just put enough NOPs past the EEPROM access C line and return from MER to anywhere between those NOPs.

EEPROM read function looks like this

void ReadEepromByte(void *src, char *dst)
{
*dst = *(char*)(src); // read data
asm {
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
  NOP
}
fixecc(src, sizeof(char));
}

char EeepromECCUncorrectable;

static void fixecc(void *addr, int size)
{
if(EeepromECCUncorrectable)
{

// get EEPROM sector aligned address
  unsigned long strtaddr = (long)addr & ~(4-1);
  unsigned long endaddr = (long)addr + size;

// erase first sector
EEpromErasesect( (void*)strtaddr );

// in my case size is up to 8, sizeof(long long)

// check if size crosses sector boundary and erase 2nd sector

if((endaddr - strtaddr) > 4)
   EEpromErasesect( (void*)(strtaddr+4) );

  EeepromECCUncorrectable = 0;
}
}

MER handler:

#pragma NO_ENTRY
#pragma NO_EXIT
static void MachineException(void)
{
asm{

// create RTI stack frame
LEA S, (-3,S)
PSH ALL

// check MMCRCH.TGT

  LSL.B D0, MMCECH, #4
  CMP  D0, #3 << 4
  BNE  L1
  // ME fault was caused by EEPROM ECC
  ST  D0, EeepromECCUncorrectable
  LD  X, MMCPC
  LEA  X, (11,X) // longest S12Z instruction is 11 bytes long. please fill location
      // after EEPROM read access with enough NOPs
  ST  X, (26, S) // store RTI return address
  RTI    // return to
L1:
}
// for everything else go loop forever or until COP timeout
for(;;)
{
  int a;

  a++;
  if(!a)
   DEBUGPIN25_PORT ^= DEBUGPIN25_PIN;
}
}

Edward

RadekS · ‎02-05-2020

Hi Ed,

you are right.

just BTW: the simple commented example code for obtaining EEPROM address with corrupted data https://community.nxp.com/docs/DOC-344440

BR

Radek

RadekS · ‎10-11-2019

Hi Manish,

hard to say what is the root cause of such an issue.

Could you please describe the main principle of your flash corruption detection?

You mentioned secured MCU (limited debug access) and that MCU works fine after reflashing. Therefore I suppose that the NOK chip is not able to spin the motor. Correct? Any other behavior changes?

Any parameters stored in EEPROM? Any update of these parameters in runtime?

Is there any strong magnetic field near MCU (power lines, magnets,...)?

Best regards

Radek

manishsangram · ‎10-11-2019

Hello @Radek Sestak

When the flash goes 'corrupt' we see the following

1. GPIO (button) interrupts which should cause some actions like motor rotation do not work.

2. GPIO (button) interrupts do not write out logs via UART

3. UART commands (sent via PC USB-UART) do not work and no response is sent back (ack text is expected)

All external voltages are okay including VDDX, VDDA, VSUP etc. are okay.

Attaching the board to the PEMICRO programmer for re-flashing does not cause any issue. It detects that the FLASH is secured and asks for permission to un secure it for programming. Once flashed system starts working.

EEPROM : We do use EEPROM for storage of parameters and write them at runtime. These parameters can cause faulty application behaviour but not a 'dead device'.

We have at least 1 board with MCU where EEPROM double bit ECC occurs repeatedly and that board has been kept for further study, we do not know the root cause which caused the EEPROM to go bad. It can't be number of writes because we do not write more than 10-20 times a day which should give us many years of life.

There is a small BLDC motor running 2cm away from the chip.

kef2 · ‎02-03-2020

Hi,

I think your problem is not corruption of flash, but ECC double bit fault in EEPROM, which triggers machine exception (ME). Ant if ECC ME happens, you can't just let it run, code won't recover without special code to handle EEPROM ECC ME. And EEPROM ECC fault is easy to get if unit can be unpowered while EEPROM erase/program is in progress...

Edward

danielmartynek · ‎10-08-2019

Hello Manish,

Just to be sure, is the VDDF voltage within specification (Table B-1, #3, S12ZVM RM rev2.13)?

Is the TEST pin connected to the ground?

Thanks,

BR, Daniel

manishsangram · ‎10-09-2019

Hi @Daniel Martynek

I can confirm that the TEST pin is grounded.

I have not checked the voltage for VDDF, where would you like me to check it? However logically even if the VDDF goes out of range, it should get restored on reboot right? This does not happen.

Since the Protect and Secure bits are on, it is not possible for the flash to get corrupted due to wrong instructions being read? I mean it can happen in a freak accident once in a while but it cannot be repeated many times.

We have to re program (flash) the flash for it to work again. This means there is nothing wrong in the physical circuit because after the flash is reprogrammed it works properly again.

Also we have the temperature being monitored. Externally we are in 50-60*C max.

Since the protection and security bit are on, we cannot debug after such an event.

danielmartynek · ‎10-10-2019

Hi Manish,

I have seen such an unpredictable behavior when the VDDF pin was shorted to VDDX/VDDA.

The MCU gets secure during the Reset sequence when either the SEC bits (SEC =! 0b10) are copied from the Flash to the FSEC register or if a double bit fault is detected while reading the P-Flash phrase containing the Flash security byte. The default out-of-reset value of FSTAT_MGSTAT[1:0] is 0b00, unless a double bit fault is detected during the reset sequence. Could you please read the FSTAT register after startup?

Thanks,

BR, Daniel