NAND bad blocks: problems at boot

da1 · ‎09-02-2020

Hi,

we are experiencing some reliability problems on a i.MX6ULL system with a Micron MLC MT29F32G08CBADAWP NAND and we have some questions on the best practices to follow, using NANDs.

We have few systems that sometimes can't boot due to a CRC error on the initramfs partition. We suppose that sometimes the error can also be present on the kernel and dtb MTD partitions. In this case the system have only few months of life, so it's not a very old NAND.

The error we see:

U-Boot 2016.03-nxp/imx_v2016.03_4.1.15_2.0.0_ga+ga57b13b (Aug 04 2020 - 09:37:05 +0000)
CPU:   Freescale i.MX6ULL rev1.1 528 MHz (running at 396 MHz)
CPU:   Industrial temperature grade (-40C to 105C) at 60C
Reset cause: WDOG
Board: MX6ULL 14x14 EVK
I2C:   ready
DRAM:  512 MiB
NAND:  4096 MiB
MMC:   FSL_SDHC: 0, FSL_SDHC: 1
*** Warning - bad CRC, using default environment
Display: TFT43AB (480x272)
Video: 480x272x24
In:    serial
Out:   serial
Err:   serial
Net:   FEC0
Normal Boot
Autoboot in 3 seconds
NAND read: device 0 offset 0x4000000, size 0x800000
 8388608 bytes read: OK
NAND read: device 0 offset 0x5000000, size 0x100000
 1048576 bytes read: OK
NAND read: device 0 offset 0x6000000, size 0x1200000
Skipping bad block 0x06600000
 18874368 bytes read: OK
Kernel image @ 0x80800000 [ 0x000000 - 0x6f9708 ]
## Loading init Ramdisk from Legacy Image at 83800000 ...
   Image Name:   Init Ram Disk
   Image Type:   ARM Linux RAMDisk Image (gzip compressed)
   Data Size:    8897996 Bytes = 8.5 MiB
   Load Address: 00000000
   Entry Point:  00000000
   Verifying Checksum ... Bad Data CRC
Ramdisk image is corrupt or invalid

Our NAND has an erase size of 2MB and is split as follow: uboot (64MB), kernel (16MB), dtb (16MB), system (rest of the 4GB NAND) and there is plenty of spare blocks on each partition.

When the system fails, looking at the Bad Blocks Table from u-boot we see:

=> nand bad

Device 0 bad blocks:
06600000
0b400000
0b600000

The first one being, in fact, inside the initramfs partition. (we see the others two on every system, so we support they are present by default on the NAND chip itself).

Considering that the u-boot, kernel, dtb and initramfs are only written once in production, and the system may go many months without a reboot, we wonder how those partitions may get corrupted.

We have few hypotheses:

over time an excessive number of bitflips happens on a block, making it uncorrectable
a single bitflip happens on the first byte on the OOB part of a block, changing it from 0xff and as such marking it as bad

Does these ideas make sense?

How can it be that, sometimes, the same system is then able to boot again (without any intervention)?

Another strange thing we notice is that re-flashing only the u-boot (and nothing else) have made the bad block on the initramfs partition disappear.
To flash partitions we erase them with flash_erase /dev/mtdX and then write them using nandwrite (or kobs-ng for u-boot).

We also have some questions about how we can make the system more reliable: is it a good idea to duplicate the kernel, dtb and initramfs partitions and detect in some way a failure in u-boot, so that a second copy can be loaded if the first fails?

Now, a very wild guess: is it possible that reading these partitions periodically from a live system (using nanddump, for example) could improve their health; the reasoning here is as follow: reading it with nanddump would detect (and correct?) bitflips, preventing them to accumulate over time. Does it make sense? We are asking because we read on Micron documentation that bad blocks are marked only on erase and write operations, but we assume that they are also marked as bad when reading them if there are too many bitflips to correct the errors. Is it right? In our case we never erase or write them again, just read them on a reboot so we do not understand how they can be marked as bad.

Best regards.

igorpadykov · ‎09-09-2020

Hi da1

one can look at attached document for nand bad block handling in

i.mx software, it is for i.MX5x processors but idea is the same for

i.MX6 too.

Best regards
igor

View solution in original post

igorpadykov · ‎09-07-2020

Hi da1

partitions also may get corrupted when power was suddenly removed from board.

>how we can make the system more reliable: is it a good idea to duplicate the kernel,
>dtb and initramfs partitions and detect in some way a failure in u-boot, so that a second
>copy can be loaded if the first fails?

yes it is reasonable idea.

>we read on Micron documentation that bad blocks are marked only on erase and write
>operations, but we assume that they are also marked as bad when reading them if there
>are too many bitflips to correct the errors. Is it right?

yes right. NXP NAND driver supports own bad block table, different from micron.

Best regards
igor

da1 · ‎09-09-2020

>>we read on Micron documentation that bad blocks are marked only on erase and write
>>operations, but we assume that they are also marked as bad when reading them if there
>>are too many bitflips to correct the errors. Is it right?

>yes right. NXP NAND driver supports own bad block table, different from micron.

Here you are talking about the DBBT table in the CBC, correct?

We notice that it's reset each time we write u-boot.

Thanks for your advices.

igorpadykov · ‎09-09-2020

Hi da1

one can look at attached document for nand bad block handling in

i.mx software, it is for i.MX5x processors but idea is the same for

i.MX6 too.

Best regards
igor

da1 · ‎09-09-2020

That's really helpful, thank you.

Do you know how we can tell if this BBI swap + BBT solution is enabled?

Our system runs the imx_4.1.15_2.0.0_ga, both for the running system and for flashing. Our kobs-ng is the latest version and u-boot is 2016.03.

In our source the cited files (drivers/mtd/nand/mxc_nd2.*) are not present, we only jave drivers/mtd/nand/mxc_nand.*

Best regards.

igorpadykov · ‎09-09-2020

yes it is enabled, one can look at NAND driver sources, pay attention to

/* add our owner bbt descriptor */

static struct nand_bbt_descr gpmi_bbt_descr

https://source.codeaurora.org/external/imx/linux-imx/tree/drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c...

Best regards
igor

da1 · ‎09-09-2020

I confirm it's enabled on our system (CONFIG_MTD_NAND_MXC=y is also set).

Thanks for your help.