Handling bit flip in erased page

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Handling bit flip in erased page

7,712 Views
gnanachandrandh
Contributor II

Hi There,

In our platform, SLC Micron NAND 2GB with 512K Block size and min I/O 4K is used for software storage. Software component are flashed using MFG tool (configured for imx6d). Right after the flashing, Random bitflips are  seen in some partitions mostly large in number on rootfs partition while dump with "nanddump" utilities.

Though  NAND is configured with 16 bit ECC strength for bit correction, In rootfs partition,  Bitflip in erased page can't be recovered since OOB is also erased. Hence error -EBADMSG is shown in log during booting.

After good study and search on various forum, Following link seems to be an interesting patch for this issue.

[PATCH v1] mtd: gpmi: Bitflip support in erased regions

This patch suggest to do two things

          1. Setting erase threshold to ecc strength in BCH register

          2. Manually correcting bitflip in gpmi_ecc_read_page if erase page with bitflip found.

Now I have three questions for Freescale community.

        1. I would like to know why this patch has not been considered. Are they any side effect if this patch is applied ?

        2. Setting a value for erase threshold in BCH register, whether bitflip in OOB region is also counted for the value ? or

            bitflip in data region is alone counted for the value ?

        3.  Can you point out any mechanism to detect bitflip in OOB region of erased pages other that comparing it with 0xff ?

  

We are using IMX6d auto processor  and yocto BSP 3.10.17 kernel.

Cheers

Gnana

17 Replies

4,772 Views
BiyongSUN
NXP Employee
NXP Employee

Gnana,

The -74 is very famous issue filed in MTD(Memory Technology Devices)

Could you please check the following link to see  if the issue is exactly you are facing.

  1. Have you used
    the ubiformat to flash the image to nand?
  2. If you don’t
    use the ubiformat, have you enable the “-space-fixeup”?

And it introduced two ways to fix it.

http://www.linux-mtd.infradead.org/faq/ubi.html#L_ecc_error

Untitled.png

0 Kudos

4,772 Views
x10
Contributor V

Hi,

We got the same issue on iMX28 + Spansion S34ML01G100/S34ML02G100. The above patch had been added to 3.12.10 kernel. The test will run all this week, and results will be reported to this thread.

4,772 Views
hungry_horace
Contributor II

Hi there, we're encountering bitflips in erased pages (i.MX28, Spansion S34ML08G). Typically only one or two pages per device. The patch attached above, applied to the 4.4 kernel, appears to resolve the issue for us. Wondering if there's any reason it hasn't been submitted yet?

0 Kudos

4,772 Views
aananthcn
Contributor II

We submitted back to NXP. We assume that they upstream to 'their' mainline kernel first and subsequently upstreamed to mainline kernel.

Note: The above patch passed our extensive testing gone to production. No more failures reported from field for the past ~2 years.

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

HI,

When MFG tool is used, FCB is protected by ECC,  the kobs-ng generate the software ECC code, but when nanddump is used, it use hardware to generate ECC code, these two use the different algorithms. So the key should not match each other. If the patch fixes the issue, that's a problem.

Thanks and Regards,

-Devaraj

0 Kudos

4,772 Views
aananthcn
Contributor II

Hello Devaraj,

Thanks. The above query is more about ERASE_THRESHOLD field in BCH_MODEn register. This register and the patch mentioned by the Gnana seems to be a good solution for bit flips in erased region. But we would like to know why the patch is not accepted by Freescale.

Regards,

Aananth C N

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

Hi Ananth,

Apps team was working on this issue and they have observed the similar bit-flip issue on some boards.

The issue has two factors 1. NAND Chip bit-flip 2. Board timing

Apps team had created a separate patch, when tested with that patch the issue was reproduced after testing continuously for 4-5 days.

So, they are further investigating on this.

For the patch mentioned by Gnana the Apps team has suggested to do a test for 4-5 days with that patch, if no issues found you can use that patch.

Thanks and Regards,

-Devaraj

0 Kudos

4,772 Views
gnanachandrandh
Contributor II

Devaraj,

Thanks for the reply.

The patch what I mentioned doesn't help if there is a bit flip in OOB region. Even a single bit flip in OOB region can lead to error condition reporting one or more uncorrectable bit flips in data region of the page.

Also can you share the patch that App team has created to replicate the issue. We couldn't replicate the issue at test bench.

Cheers

Gnana

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

Hi Gnana,

You mentioned you couldn't replicate the issue at bench.

Does that mean the issue not seen any more.

Please clarify.

Is the UBIFS crash when booting up but the system can recover after reset?

What kind of BBT mechanism was used, RAM based or flash based?

Thanks and Regards,

-Devaraj

0 Kudos

4,772 Views
gnanachandrandh
Contributor II

Devaraj,

I meant we have not replicated the error condition "uncorrectable bitflip" in our test bench though we had run the development unit for more than week.

we identified the uncorrectable flip issue in the units return from field. We are doing all our analysis and testing on the field-returned-units that are in limited numbers.

So if  the patch created by your App team could replicate the error condition at test bench, Please share us so it would help us to replicate the issue in development unit and test the patch (I mentioned) in development unit itself.

In w207, During boot up, U-boot loads kernel with initramfs. Initramfs has early_init script that run early video(mahindra logo video) then mount ubifs rootfs partition (/dev/mtd6). On success mount of rootfs partition, change root is called.

But during error condition, mounting of rootfs partition is failed and stays in initramfs context. There is a health monitor running in VIP( an external micro) that reboots the IMX6 if no response received from IMX6. The reboot happens for several number of times.

Please find the boot log with ubi related lines are underlined.

U-Boot 2013.04-eagle-imx6-gce0ea25-r1 (May 07 2015 - 15:22:30)

I2C:   ready

DRAM:  1 GiB

NAND:  2048 MiB

Using default environment

In:    serial

Out:   serial

Err:   serial

Normal Boot

Hit any key to stop autoboot:  0

NAND read: device 0 offset 0x800000, size 0x1000000

16777216 bytes read: OK

## Booting kernel from Legacy Image at 10007fc0 ...

   Image Name:   eagle-imx6-3.10.17(dt)

   Image Type:   ARM Linux Kernel Image (uncompressed)

   Data Size:    7826477 Bytes = 7.5 MiB

   Load Address: 10008000

   Entry Point:  10008000

   Verifying Checksum ... OK

   XIP Kernel Image ... OK

OK

Total milliseconds boot time: 751

Starting kernel ...

Gating GPMI Clock Source before Initialization

DBG sensor data is at 80dd3998

Failed to execute /init

VMF_SHM_IPC: pid=102, nw_vmf_ipc_open(000000): unknown channel name

../../../../ai.app.2015.mahindra.(NULL device *):  DMA Disable for UART Port: 4

w207l3.infrastructure.host/vip.ipcl.ap/src/ipcl.c:351 uart ipcl port:5

Connetion established curr ign = 0

checking for link up

query message sentmsg received from grp = 92 , event = 14

link up status received via boradcast

get the ACC info from host msg sent

timer create returns 0

Entered a bad state

******************************************** audio config is 1 ******************************

year msb = 7 year lsb = 223

year = 2015

sec 44 min 1 hr 12 day 1 month 0 year 115

new time is 1420113704

ret value of set time = 0 ,errno = 2

TIme and Date set from early_rvc_Service

iopin = 0

Display set to Normal mode successfully

display_mode_set done

in send state msg state = 2

send done with state = 2

Playing WAVE '/opt/boot-splash/MahindraAudio.wav' : Signed 16 bit Little Endian, Rate 16000 Hz, Stereo

UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 908:4096, read 4096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 516096 bytes from PEB 908:8192, read 516096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 976:4096, read 4096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 516096 bytes from PEB 976:8192, read 516096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 978:4096, read 4096 bytes

in state handle

in state handle

UBI error: ubi_io_read: error -74 (ECC error) while reading 516096 bytes from PEB 978:8192, read 516096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 1024:4096, read 4096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 516096 bytes from PEB 1024:8192, read 516096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 1026:4096, read 4096 bytes

UBI error: ubi_io_read: error -74 (ECC error) while reading 516096 bytes from PEB 1026:8192, read 516096 bytes

in state handle

UBI device number 0, total 2646 LEBs (1365590016 bytes, 1.3 GiB), available 0 LEBs (0 bytes), LEB size 516096 bytes (504.0 KiB)

in state handle

UBI error: ubi_io_read: error -74 (ECC error) while reading 57344 bytes from PEB 1074:466944, read 57344 bytes

UBIFS error (pid 126): ubifs_recover_leb: corrupt empty space LEB 709:458752, corruption starts at 3728

UBIFS error (pid 126): ubifs_scanned_corruption: corruption at LEB 709:462480

UBIFS error (pid 126): ubifs_scanned_corruption: first 8192 bytes from LEB 709:462480

UBIFS error (pid 126): ubifs_recover_leb: LEB 709 scanning failed

mount: mounting /dev/ubi0_0 on /mnt failed: Structure needs cleaning

mount: mounting /dev on /mnt/dev failed: No such file or directory

one or more files missing

one or more files missing

starting failsafe shell

one or more files missing

one or more files missing

one or more files missing

one or more files missing

one or more files missing

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

Thanks Gnana.

Please also clarify on these two things.

Is the UBIFS crash when booting up but the system can recover after reset?

What kind of BBT mechanism was used, RAM based or flash based?

0 Kudos

4,772 Views
aananthcn
Contributor II

Hi Devaraj,

It is Flash based BBT only. Also we are sure that the failure mode is due to bad block, it is a bit flip in erased region (we have taken a dump and confirmed this) and the driver returns error to UBIFS and UBIFS stops mounting the partition.

We have also confirmed that the warranty unit recovers if free-space-fix-up flag is set. But if we don't set that flag, the system never recovers. So it is not a bad block. It is a ECC failure on erased page.

Gnana's question was more on the hardware, which can ignore X number of bit flips (based on ERASE_THRESHOLD value set in register) on the data region, but what about bit flip in OOB (the 224 bytes out of 4096 bytes) region. Can hardware ignore bit flips in those regions as well?

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

Hi,

Please find the patch related to NAND bit flips issue attached with this mail.

Please note that this is not a formal release of the patch, due to request from Visteon to share the intermediate patch, we are sharing this.

Thanks and Regards,

-Devaraj

0 Kudos

4,772 Views
gnanachandrandh
Contributor II

Devaraj,

Thanks for the patch.

I have few questions on this patch

1. Subject of the patch shows "[PATCH 1/2] mtd: gpmi: fix the bitflips for erased page".  Is there any second part of this patch.

2. In comments [2] : we count bitflips with ECC disabled for whole page. Which is N2. Will N2 < threshold ?

     Threshold is at max ECC strength. Will number of bitflip for whole page less than threshold (ECC strength of a chunk)

3. In comments [4]: Can we have some more explanation for the comments because this looks similar to Comment [3]

4. In Code, threshold = geo->gf_len / 2; Can I have some explanation for initial value of threshold ?

5. the Following  code is contradiction to Comment [2]. bitflip is screened for first 512 bytes of the page (read) with ECC disabled irrespective which chunk currently under consideration.

     /* Count the bitflips for the no ECC buffer */

     for (i = 0; i < mtd->writesize / 8; i++) {

     flip_bits_noecc += hweight64(~buf[i]);

     if (flip_bits_noecc > threshold)

     return false;

     }

6.In following code, loop is broken without checking status bytes of next chunk if current chunk is found to be erased with acceptable bitflips.

     if (*status == STATUS_UNCORRECTABLE) {

           if (gpmi_erased_check(this, payload_virt, i, page, &max_bitflips))

                break;

Cheers

Gnana

0 Kudos

4,772 Views
devarajgs
NXP Employee
NXP Employee

Hi,

As you mentioned earlier, if you would like to use the patch from MTD community, please use the patch v7 from the link below:

[PATCH v7] mtd: gpmi: Deal with bitflips in erased regions

As mentioned earlier, this patch suggests to do two things:

1. Setting erase threshold to ECC strength n BCH register

2. Manually correcting bit-flip in gpmi_ecc_read_page if erase page with bit-flip found.

Since the ALLONES bit can not work properly, SW need to check all possible erased chunk and this extra overhead will significantly impact the performance.

ERASE_THRESHOLD considers the bit-flip for both data and ECC in the specified chunk. So, all bit-flips should be detected.

The earlier patch that we had sent to you uses another way to detect the bit-flips, it doesn't set the ERASE_THRESHOLD bit.

So, even one bit bit-flip will cause the status to be returned as UNCORRECTABLE and then reads the data twice.

- First time tries to figure out the UNCORRECTABLE status returned on all 0xFF chunk or on just a ordinary chunk.

- Second time checks whether it is an erased page or a page written with 0xFF. Only the erased page could be all 0xFF for both Data and ECC.

If both the above tests are true, the page is an erased page, it should be set to all 0xFF and don't need to go back to the loop to check other chunks.

So, we feel both of the patch should work for bit-flip issue.

Bit-flip is a very rare event. Apps team is still working on a easy way to reproduce the bit-flip and to fully test the patches as this issue not easily reproducible.

Need your input on how to reproduce this issue quickly so that it will be helpful for us to proceed and verify the patches.

Thanks and Regards,

-Devaraj

4,771 Views
gnanachandrandh
Contributor II

Hi Devaraj,

Please find the patch for fixing bit flip in erased page. This patch has been in testing for past two days and were able to recovery units returned from field without disturbing their file system partition. Patch is also tested on the simulated erased pages that has less and more number of bitflips than set erase threshold.

Kindly share your review comments for the patch

Cheers

Gnana

4,770 Views
devarajgs
NXP Employee
NXP Employee

Hi Gnana,

Also attach error log when booting.

Thanks and Regards,

-Devaraj

0 Kudos