K64 Flash write sometimes fails

larrydemuth · ‎01-30-2019

I have an application that writes a new image to flash "swap1" flash bank and most of the time it works, but every now and then a write will fail with a IO_ERROR_WRITE error. This happens even if I use the same image and try it multiple times. Most of the time everything works fine, but every now and then a write fails.

I am using the K64FN part and MQX 4.1.

I made sure the writes are 8 bytes at a time, phrase aligned, and are sequential, so I'm not writing to the same phrase, and the phrase is erased before writing to it.

I did some testing and set a breakpoint when the write fails (see below) and here is what I found:

The address the write fails on is random.

The address IS on a phrase boundary, and 8 bytes are to be written.

The phrase appears to have been written to flash because I can look at memory when it fails and the written phrase is there.

If I look at the FTFE registers the FSTAT is 0x80 (no errors) and the FCCOB registers contain the proper information for the write. The command is 0x07 (program phrase).

The file pointer has ERROR set to 0xA03 (IO_ERROR_WRITE).

The file pointer LOCATION is set to the address I wanted to write to.

I tried reading the first byte I was about to write to, to verify it was 0xFF before the write and it was. After the write that fails the byte was not 0xFF but what I was writing (verified by looking into memory). It appears the write actually happened, but the _io_write function returns IO_ERROR anyway for some reason.

Here are parts of my code:

opens the bank and erases:

flash_file = _io_fopen("flashx:swap1", NULL);
if (NULL == flash_file)
{
return UPDATE_ERROR_OPEN_FILE;
}

retval = Erase_Flash(flash_file, 0, MAX_FLASH); <----- erases all but the last sector
if (ERASE_SUCCESS == retval)
{
      if (IO_ERROR == _io_fseek(flash_file, (_file_offset)0, IO_SEEK_SET))
      {
            retval = ERASE_SEEK;
      }
}

The write portion:

static uint32_t Flash_Write_and_Verify(uint32_t address, uint8_t *buffer, uint32_t num_bytes)
{
uint32_t retval = UPDATE_OK;

      if (IO_ERROR == _io_write(flash_file, buffer, (int32_t)num_bytes)) <------ sometimes fails here
      {
            retval = UPDATE_ERROR_WRITE;      <----- breakpoint is set here
      }
      else
      {
            if (IO_ERROR == Verify_Written(address, buffer, num_bytes)) <---- directly reads bytes, compares to buffer
            {
                  retval = UPDATE_ERROR_WRITE;
            }
      }
      return retval;
}

I have tried _task_stop_preemption() before the io_write and _task_start_preemption() after in case another task was interfering with the write. It didn't make a difference. The FTFE code appears to disable / enable interrupts so I didn't try that.

larrydemuth · ‎01-30-2019

OK, I did some more debugging and this is what I see.

In ftfe_flash_write_sector I set a breakpoint if ftfe_flash_command_sequence does not return FTFE_OK.

Looking at the disassembly window and CPU registers it appears it returned 0x4 (FTFE_ERR_ACCERR).

The FTFE registers at this point contain:

Oddly FSTAT = 0x80 which is no errors, and the only way to clear them is to write a 1 to them.

My FTFE Flash Config registers contain:

I don't see why I would get an access error!

Some other debugging I did was to read the 8 bytes where I was about to write to and verified all 8 were 0xFF before writing. I also read the 8 bytes after the "error" and they contained what I wrote even though an error was returned.

I have an idea for a work around since it DID write the bytes, but would like to know if there is a problem with the K64 FTFE where bogus errors are returned, or if I have some strange bug.

mjbcswitzerland · ‎01-30-2019

Hi Larry

I have used the K64 in many product developments with intensive flash write usage without ever experiencing a write error so can only imaging a code error (or pre-emption or interrupt an unsuitable moment), or memory corruption, leading to such an effect.

To solve this you need to stop looking at the return value and instead look at the code that does the error checking. This is for two reasons :
- you believe that the write actually worked (because you can verify it is in Flash) so the return value may be getting corrupted after the write has been otherwise checked for OK. You cannot yet trust the return value unless you trust that it is reflecting correctly the check.
- FSTAT flags are set on errors and can be reset by the same code that checked (and found errors). Without knowing whether the code has already reset error flags you can't make any conclusions on the FSTAT value after the routine has returned. At the moment without this information the evidence that you show is still untrustworthy to solving such a case
Generally it is best to clear (old) error flags before starting flash operations and leave them set after finding them; this removes this uncertainly because they can be checked again later if desired..

I would do the following to quickly get to know what is actually happening:
1. Remove code that resets flags so that you can be sure that you can verify that an error as really read or not; this may help distinguish between some HW error really being signaled and a corruption in the return value.
2. Try putting a break point in the code that actually finds the error too. This is usually in SRAM but if you step into the routine in disassembler mode you can easily see which assembler instructions are reading the FSTAT and checking for error flags. This is probably the most reliable since if the break point is hit you see the value that it reads from the register in the register it is checking and you also see the FSTAT register itself. If this is not hit it points to a corruption on returning from the call.
3. If you have difficulties with 2. add a write to a specific address to the code that detects the FSTAT error flag. Then set a HW write break-point on this address so that you can stop the debugger when it happens.

Regards

Mark

Complete K64 solutions, training and support:http://www.utasker.com/kinetis.html
Kinetis K64:
- http://www.utasker.com/kinetis/FRDM-K64F.html
- http://www.utasker.com/kinetis/TWR-K64F120M.html
- http://www.utasker.com/kinetis/TEENSY_3.5.html
- http://www.utasker.com/kinetis/Hexiwear-K64F.html

larrydemuth · ‎01-31-2019

Mark,

Thanks for the response.

I'm using the FTFE functions in MQX 4.1 to do the actual writes to flash. The ftfe_flash_command_sequence function (see below) resets the FSTAT errors before writing to flash, and after the RAM function completes, checks the errors and returns 0 if no error or the error if there is one. It does not clear the FSTAT register errors if an error is found. This is what confuses me since it is returning an error even though FSTAT doesn't show one.

I tried setting a breakpoint in the below routine where it checks FTFE_FSTAT_ACCERR_MASK and oddly I hit this when erasing. It hits this every time I erase the sectors and my erase routine returns a failure. The FSTAT register reads 0x80 (no errors) at this time, so I don't understand how it could be hitting it. The FSTAT register is read into a variable to determine the errors and the variable also reads 0x80. If I remove the breakpoint the erase routine works every time. That being said, I couldn't set the breakpoint there for when I write since the erase would fail. I'm not doing any optimization so that shouldn't be causing a problem. The code should be compiled as is.

I'm going to adjust the code so the erase happens at a different time so I can set the breakpoint there, and check it while writing.

I have done flash writes on K60's and K66's and have never had this problem, but I was using a different version of MQX also. That's why I question if the K64 may have a problem. If you say you have never seen a problem with writes on the K64 then I don't think its a K64 problem. What MQX version do you use? (if you use MQX)

I just had another thought. Maybe my stack isn't large enough for the task doing the writes... I've seen many strange things with a corrupt stack.

static uint32_t ftfe_flash_command_sequence
(
/* [IN] Flash specific structure */
volatile FTFE_FLASH_INTERNAL_STRUCT_PTR dev_spec_ptr,

/* [IN] Command byte array */
uint8_t *command_array,

/* [IN] Number of values in the array */
uint8_t count,

/* [IN] The address which will be affected by command */
void *affected_addr,

/* [IN] The address which will be affected by command */
uint32_t affected_size
)
{
uint8_t fstat;
uint32_t result;
void (* RunInRAM)(volatile uint8_t *);
#if PSP_MQX_CPU_IS_KINETIS
void (* RunInvalidateInRAM)(uint32_t);
#endif
#if PSP_MQX_CPU_IS_COLDFIRE
uint32_t temp;
#endif
FTFE_MemMapPtr ftfe_ptr;

ftfe_ptr = (FTFE_MemMapPtr)dev_spec_ptr->ftfe_ptr;

/* get pointer to RunInRAM function */
RunInRAM = (void(*)(volatile uint8_t *))(dev_spec_ptr->flash_execute_code_ptr);

/* set the default return as FTFE_OK */
result = FTFE_OK;

/* check CCIF bit of the flash status register */
while (0 == (ftfe_ptr->FSTAT & FTFE_FSTAT_CCIF_MASK))
{ };

/* clear RDCOLERR & ACCERR & FPVIOL error flags in flash status register */
if (ftfe_ptr->FSTAT & FTFE_FSTAT_RDCOLERR_MASK)
{
ftfe_ptr->FSTAT |= FTFE_FSTAT_RDCOLERR_MASK;
}
if (ftfe_ptr->FSTAT & FTFE_FSTAT_ACCERR_MASK)
{
ftfe_ptr->FSTAT |= FTFE_FSTAT_ACCERR_MASK;
}
if (ftfe_ptr->FSTAT & FTFE_FSTAT_FPVIOL_MASK)
{
ftfe_ptr->FSTAT |= FTFE_FSTAT_FPVIOL_MASK;
}

switch (count)
{
case 12: ftfe_ptr->FCCOBB = command_array[--count];
case 11: ftfe_ptr->FCCOBA = command_array[--count];
case 10: ftfe_ptr->FCCOB9 = command_array[--count];
case 9: ftfe_ptr->FCCOB8 = command_array[--count];
case 8: ftfe_ptr->FCCOB7 = command_array[--count];
case 7: ftfe_ptr->FCCOB6 = command_array[--count];
case 6: ftfe_ptr->FCCOB5 = command_array[--count];
case 5: ftfe_ptr->FCCOB4 = command_array[--count];
case 4: ftfe_ptr->FCCOB3 = command_array[--count];
case 3: ftfe_ptr->FCCOB2 = command_array[--count];
case 2: ftfe_ptr->FCCOB1 = command_array[--count];
case 1: ftfe_ptr->FCCOB0 = command_array[--count];
default: break;
}

#if PSP_MQX_CPU_IS_COLDFIRE
temp = _psp_get_sr();
_psp_set_sr(temp | 0x0700);
#elif PSP_MQX_CPU_IS_KINETIS
__disable_interrupt ();
#endif //PSP_MQX_CPU_IS_KINETIS

/* run command and wait for it to finish (must execute from RAM) */
RunInRAM(&ftfe_ptr->FSTAT);

/* get flash status register value */
fstat = ftfe_ptr->FSTAT;

#if PSP_MQX_CPU_IS_KINETIS
RunInvalidateInRAM = (void(*)(uint32_t))(dev_spec_ptr->flash_invalidate_code_ptr);
RunInvalidateInRAM((uint32_t)FLASHX_INVALIDATE_CACHE_ALL);
#endif

/*
invalidate data cache of 'affected_addr' address and 'affected_size' size
because reading flash through code-bus may show incorrect data
*/
#if defined(_DCACHE_INVALIDATE_MLINES) || defined(_ICACHE_INVALIDATE_MLINES)
if (affected_size)
{
#if defined(_DCACHE_INVALIDATE_MLINES)
_DCACHE_INVALIDATE_MLINES(affected_addr, affected_size);
#endif
#if defined(_ICACHE_INVALIDATE_MLINES)
_ICACHE_INVALIDATE_MLINES(affected_addr, affected_size);
#endif
}
#endif

#if PSP_MQX_CPU_IS_COLDFIRE
_psp_set_sr(temp);
#elif PSP_MQX_CPU_IS_KINETIS
__enable_interrupt();
#endif //PSP_MQX_CPU_IS_KINETIS

/* checking access error */
if (0 != (fstat & FTFE_FSTAT_ACCERR_MASK))
{
/* return an error code FTFE_ERR_ACCERR */
result = FTFE_ERR_ACCERR;
}
/* checking protection error */
else if (0 != (fstat & FTFE_FSTAT_FPVIOL_MASK))
{
/* return an error code FTFE_ERR_PVIOL */
result = FTFE_ERR_PVIOL;
}
/* checking MGSTAT0 non-correctable error */
else if (0 != (fstat & FTFE_FSTAT_MGSTAT0_MASK))
{
/* return an error code FTFE_ERR_MGSTAT0 */
result = FTFE_ERR_MGSTAT0;
}

return result;
}

mjbcswitzerland · ‎01-31-2019

Larry

It is looking more like stack corruption - one of the main complications with using such environments.

I don't use MQX and I didn't think it was used any more after being dropped by NXP.
I use uTasker (sometimes combined with FreeRTOS).

Regards

Mark

larrydemuth · ‎01-31-2019

NXP may have dropped it, but it still works so I continue to use it. Especially if I need to do anything Ethernet! MQX has very good functionality for that. I do a lot of HTTP stuff so wouldn't know how to do it without MQX.

NXP, Shame on you for not supporting it any more!!

Anyway increasing the stack size did not help.

I normally use another IDE, but for this customer I have to use IAR, and have no control over MQX selection.

I don't know if I can trust what I'm seeing when I debug into MQX since I don't know of a way to have the MQX projects open at the same time. Seeing the breakpoint being hit on erase and the erase failing, but erase not failing without the breakpoint leads me to believe the breakpoint isn't actually where I think it is. Same for the writes.

I'm just going to go to my backup plan and not declare an error just because the write says it fails. I'm going to let my verify code determine if it actually wrote the data.

Thanks for the suggestions anyway!

If someone at NXP may have some insight into the problem, please respond!

danielchen · ‎01-31-2019

Hi Larry:

Since you are using MQX 4.1, I would suggest you test this on MQX 4.2.0, and patch 4.2.0.2. I noticed this version fixed some Flashx issues. I remember one issue is related only with K64/k70.

I also suggest you update to MQX 4.2.0, patch 4.2.0.2. I attached the patch release note.

Regards

Daniel