Hi,
we are experiencing some reliability problems on a i.MX6ULL system with a Micron MLC MT29F32G08CBADAWP NAND and we have some questions on the best practices to follow, using NANDs.
We have few systems that sometimes can't boot due to a CRC error on the initramfs partition. We suppose that sometimes the error can also be present on the kernel and dtb MTD partitions. In this case the system have only few months of life, so it's not a very old NAND.
The error we see:
U-Boot 2016.03-nxp/imx_v2016.03_4.1.15_2.0.0_ga+ga57b13b (Aug 04 2020 - 09:37:05 +0000)
CPU: Freescale i.MX6ULL rev1.1 528 MHz (running at 396 MHz)
CPU: Industrial temperature grade (-40C to 105C) at 60C
Reset cause: WDOG
Board: MX6ULL 14x14 EVK
I2C: ready
DRAM: 512 MiB
NAND: 4096 MiB
MMC: FSL_SDHC: 0, FSL_SDHC: 1
*** Warning - bad CRC, using default environment
Display: TFT43AB (480x272)
Video: 480x272x24
In: serial
Out: serial
Err: serial
Net: FEC0
Normal Boot
Autoboot in 3 seconds
NAND read: device 0 offset 0x4000000, size 0x800000
8388608 bytes read: OK
NAND read: device 0 offset 0x5000000, size 0x100000
1048576 bytes read: OK
NAND read: device 0 offset 0x6000000, size 0x1200000
Skipping bad block 0x06600000
18874368 bytes read: OK
Kernel image @ 0x80800000 [ 0x000000 - 0x6f9708 ]
## Loading init Ramdisk from Legacy Image at 83800000 ...
Image Name: Init Ram Disk
Image Type: ARM Linux RAMDisk Image (gzip compressed)
Data Size: 8897996 Bytes = 8.5 MiB
Load Address: 00000000
Entry Point: 00000000
Verifying Checksum ... Bad Data CRC
Ramdisk image is corrupt or invalid
Our NAND has an erase size of 2MB and is split as follow: uboot (64MB), kernel (16MB), dtb (16MB), system (rest of the 4GB NAND) and there is plenty of spare blocks on each partition.
When the system fails, looking at the Bad Blocks Table from u-boot we see:
=> nand bad
Device 0 bad blocks:
06600000
0b400000
0b600000
The first one being, in fact, inside the initramfs partition. (we see the others two on every system, so we support they are present by default on the NAND chip itself).
Considering that the u-boot, kernel, dtb and initramfs are only written once in production, and the system may go many months without a reboot, we wonder how those partitions may get corrupted.
We have few hypotheses:
Does these ideas make sense?
How can it be that, sometimes, the same system is then able to boot again (without any intervention)?
Another strange thing we notice is that re-flashing only the u-boot (and nothing else) have made the bad block on the initramfs partition disappear.
To flash partitions we erase them with flash_erase /dev/mtdX and then write them using nandwrite (or kobs-ng for u-boot).
We also have some questions about how we can make the system more reliable: is it a good idea to duplicate the kernel, dtb and initramfs partitions and detect in some way a failure in u-boot, so that a second copy can be loaded if the first fails?
Now, a very wild guess: is it possible that reading these partitions periodically from a live system (using nanddump, for example) could improve their health; the reasoning here is as follow: reading it with nanddump would detect (and correct?) bitflips, preventing them to accumulate over time. Does it make sense? We are asking because we read on Micron documentation that bad blocks are marked only on erase and write operations, but we assume that they are also marked as bad when reading them if there are too many bitflips to correct the errors. Is it right? In our case we never erase or write them again, just read them on a reboot so we do not understand how they can be marked as bad.
Best regards.
Solved! Go to Solution.
Hi da1
one can look at attached document for nand bad block handling in
i.mx software, it is for i.MX5x processors but idea is the same for
i.MX6 too.
Best regards
igor
Hi da1
partitions also may get corrupted when power was suddenly removed from board.
>how we can make the system more reliable: is it a good idea to duplicate the kernel,
>dtb and initramfs partitions and detect in some way a failure in u-boot, so that a second
>copy can be loaded if the first fails?
yes it is reasonable idea.
>we read on Micron documentation that bad blocks are marked only on erase and write
>operations, but we assume that they are also marked as bad when reading them if there
>are too many bitflips to correct the errors. Is it right?
yes right. NXP NAND driver supports own bad block table, different from micron.
Best regards
igor
>>we read on Micron documentation that bad blocks are marked only on erase and write
>>operations, but we assume that they are also marked as bad when reading them if there
>>are too many bitflips to correct the errors. Is it right?
>yes right. NXP NAND driver supports own bad block table, different from micron.
Here you are talking about the DBBT table in the CBC, correct?
We notice that it's reset each time we write u-boot.
Thanks for your advices.
That's really helpful, thank you.
Do you know how we can tell if this BBI swap + BBT solution is enabled?
Our system runs the imx_4.1.15_2.0.0_ga, both for the running system and for flashing. Our kobs-ng is the latest version and u-boot is 2016.03.
In our source the cited files (drivers/mtd/nand/mxc_nd2.*) are not present, we only jave drivers/mtd/nand/mxc_nand.*
Best regards.
yes it is enabled, one can look at NAND driver sources, pay attention to
/* add our owner bbt descriptor */
static struct nand_bbt_descr gpmi_bbt_descr
Best regards
igor
I confirm it's enabled on our system (CONFIG_MTD_NAND_MXC=y is also set).
Thanks for your help.