i.MX28: BCH error in L2.6.35_1.1.0_130130

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

i.MX28: BCH error in L2.6.35_1.1.0_130130

4,452 Views
PeterChan
NXP Employee
NXP Employee

In case you have the NAND driver enabled in Linux BSP release L2.6.35_1.1.0_130130 and experience BCH timeout error on i.MX28, please try this patch.

--- a/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c

+++ b/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c

@@ -46,8 +46,15 @@ static int init(struct gpmi_nfc_data *this)

        clk_enable(resources->clock);

        /* Reset the GPMI block. */

-

-       mxs_reset_block(resources->gpmi_regs + HW_GPMI_CTRL0, false);

+   /*

+    * Reset the BCH block. Notice that we pass in true for the just_enable

+    * flag. This is because the soft reset for the version 0 BCH block

+    * doesn't work and the version 1 BCH block is similar enough that we

+    * suspect the same (though this has not been officially tested). If you

+    * try to soft reset a version 0 BCH block, it becomes unusable until

+    * the next hard reset.

+    */

+       mxs_reset_block(resources->gpmi_regs + HW_GPMI_CTRL0, true);

        /* Choose NAND mode. */

        __raw_writel(BM_GPMI_CTRL1_GPMI_MODE,

@@ -108,7 +115,15 @@ static int set_geometry(struct gpmi_nfc_data *this)

        clk_enable(resources->clock);

        /* reset the BCH */

-       mxs_reset_block(resources->bch_regs, false);

+   /*

+    * Reset the BCH block. Notice that we pass in true for the just_enable

+    * flag. This is because the soft reset for the version 0 BCH block

+    * doesn't work and the version 1 BCH block is similar enough that we

+    * suspect the same (though this has not been officially tested). If you

+    * try to soft reset a version 0 BCH block, it becomes unusable until

+    * the next hard reset.

+    */

+       mxs_reset_block(resources->bch_regs, true);

        /* Configure layout 0. */

Labels (1)
0 Kudos
35 Replies

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

We have had several issues with NAND startup. The first one was DMA timeout, and using the mxs_reset routine with just-enable to false solved them.

The second issue was that some times (1/100) the NAND was totally messed (returning only 0x00 or 0xFF), causing UBI to report ALL PEBs corrupted.

We have backported an upstream fix (not coming from freescale BSP):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mtd/nand/gpmi-nand?id...


So it seems we were not the only one to have this problem. I can confirm the second BCH reset introduced into this patch solves our issue.


Is there any official statement from FSL on this issue?WIll this fix be part of a future BSP release?



Best Regards


Thomas

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hello Thomas,

I see that this upstream fix https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mtd/nand/gpmi-nand?id... can be found in our latest Linux BSP release L3.0.35_xxx. So, this fix will get into the code if the system is building from yocto,

My patch here is based on L3.0.35_1.1.0. It does not have the file gpmi-lib.c. Which Linux BSP release do your use?

Which platform do you concern? i.MX233 or i.MX28? For i.MX233, errata #2847 is describing about this issue.

In summary, may I say that setting "just_enable" to true can resolve all NAND startup issues?

Best Regards,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

Do you have any feedback to share on this BCH issue ? Is there a root cause description available on why we need the additional BCH reset ?

Has there been no other occurence of the same issue with other customers ?

Thanks


Thomas

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hello Thomas,

I am collecting information from our design team.

By the way, could you please confirm me your additional BCH reset is added to here?

diff --git a/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c b/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c

index 6cc2ca2..ca3f0ed 100644

--- a/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c

+++ b/drivers/mtd/nand/gpmi-nfc/gpmi-nfc-hal-v1.c

@@ -49,6 +49,8 @@ static int init(struct gpmi_nfc_data *this)

        mxs_reset_block(resources->gpmi_regs + HW_GPMI_CTRL0, false);

+      mxs_reset_block(resources->bch_regs, false);

+

        /* Choose NAND mode. */

        __raw_writel(BM_GPMI_CTRL1_GPMI_MODE,

                                resources->gpmi_regs + HW_GPMI_CTRL1_CLR);

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hello Peter,

Yes this is correct, the additional BCH reset we introduced is just after the GPMI reset.

It looks like this solved the main BCH timeout issue, but customer still reproduced the issue with this patch, but at lower rate (around 0.2% instead of 1% without the patch)

If there is any other track to follow in the investigations let us know.

Waiting for your team feedback

BR

Thomas

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

On i.MX28 EVK board, I write an init script to run the reboot command for NAND boot testing but am not able to reproduce any hang up after 5000 times.


May I know your steps to reproduce this problem? Do you have the boot log when the device fails to boot?

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

dmesg of the issue attached.

the additional mxs_reset is as you stated previously.

BCH timeout doesn't appear in the log but the whole NAND returns bad data to UBI.

to reproduce, we just reboot the board

Thomas

Note: it happens that we also meet this BCH timeout issue in u-boot 2009.8. the additional reset has improved the behaviour,, i.e. drastically lowering the occurences, but still some of them persist.

Attaching also u-boot log

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

Do you load the linux kernel by u-boot or run the kernel from boot stream as in default BSP?

In L2.6.35_130130 u-boot, we intend not to soft reset the BCH block when its version is < CONFIG_GPMI_NFC_V2 but it looks we got a mistake here.

static int set_geometry(struct mtd_info *mtd)

{

. . .

#if defined(CONFIG_GPMI_NFC_V2)

    gpmi_nfc_reset_block((void *)CONFIG_BCH_REG_BASE + HW_BCH_CTRL, 0);

#else

    gpmi_nfc_reset_block((void *)CONFIG_BCH_REG_BASE + HW_BCH_CTRL, 1);

#endif

where gpmi_nfc_reset_block() actually performs soft reset to the block when its 2nd parameters is true. Would you please tell me whether your u-boot has soft reset the BCH block or not?

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

>Do you load the linux kernel by u-boot or run the kernel from boot stream as in default BSP?

u-boot loads it from NAND in our setup (nbbot / bootm commands ). u-boot is also stored on NAND

About the mistake you're highlighting, it is already fixed in our baseline: we reset NFCv1 (i.mx28) in the same way as NFCv2.

To be clear: we really soft-reset NFC in the set_geometry routine , yes

Additionally, I've looked into the latest code from fsl git tree for uboot 2009.8 and I saw this was added in 2012 in gpmi_nfc_hal init, regardless of the NFC version

/* Set the busy_timeout. */

    REG_SET(CONFIG_GPMI_REG_BASE, HW_GPMI_TIMING1,

BF_GPMI_TIMING1_DEVICE_BUSY_TIMEOUT(0x500));

What is strange is that in L2.6.35 code for GPMI nand driver, this GPMI_TIMING1 register is only set for version2 of the chip.

Also the addition of this register write in u-boot it looks like related to ENGR00217505, for which I can't fnd a description on the web .

The commit says that GPMI can be unstable if we don't set this register, so is it really NFC version dependant ?

I'm currently testing a new u-boot with this GPMI register write integrated

Thomas

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi again Peter,

The BUSY_TIMEOUT setting did not solve my problem in u-boot.

I am now looking at the clock setup as this issue is sporadic, it could potentially be linked to clock not being always set correctly.

In our environment we have:

                           ____________

pll.0 -> ref_gpmi -> |GPMI_FRAC| -> gpmi_clk

                          

- In u-boot we did adaptations in order to match this scheme (bypass and frac values are double checked and they seem OK)

- in the kernel things look weird in the routine called for setting GPMI clock parent:

static int gpmi_set_parent(struct clk *clk, struct clk *parent)

{

    int ret = -EINVAL;

    if (clk->bypass_reg) {

        if (parent == clk->parent)

            return 0;

        if (parent == &ref_xtal_clk) {

            __raw_writel(1 << clk->bypass_bits,

                clk->bypass_reg + SET_REGISTER);

            ret = 0;

        }

        if (ret && (parent == &ref_gpmi_clk)) {

            __raw_writel(0 << clk->bypass_bits,

                clk->bypass_reg + CLR_REGISTER);

            ret = 0;

        }

        if (!ret)

            clk->parent = parent;

    }

    return ret;

}

"0" is written in the CLR register when setting ref_gpmi as GPMI clk parent! I think this can have following impacts:

- the GPMI bypass bit is not cleared anyway

- if the register has been well set by u-boot (our case) there is a big chance that it works

- in some occurences, and for unknow electrical reasons the register is not set correctly (i.e. set to '1') and its value will not be changed in the set_parent routine => GPMI block will keep xtal clock as reference, so it might probe, BUT the data returned in read will be totally crap. This is the problem we still meet in 0.2% of the cases (see log_error.txt above)

This would be a relatively similar problem compared to the one we had earlier on SAIF (clock tree was not always set correctly)

What do you think about this theory ? Is it normal that "0" is written into a CLR register ?

Thanks

Thomas

0 Kudos

1,794 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

Good catch! Write "0" to the CLR register should has no effect and the HW_CLKCTRL_CLKSEQ::BYPASS_GPMI bit is actually using the value as u-boot or ROM code.

However, I don't think this kernel bug is a root cause to the DMA timeout issue because your u-boot code has this bug resolved but still occasionally boot error is seen while u-boot loading kernel.

When i.MX28 exit from ROM code, the GPMI is sourced from 24MHz XTAL and its clock rate is also 24MHz. To change the GPMI source to source from PLL, PLL must be ON and divider must be properly set before clearing the HW_CLKCTRL_CLKSEQ::BYPASS_GPMI bit. Below is the pseudo code for this:

            HW_CLKCTRL_FRAC.B.CLKGATEIO = 0;

            HW_CLKCTRL_FRAC.B.IOFRAC = PfdDiv;

            HW_CLKCTRL_GPMI.B.CLKGATE = 0;

            HW_CLKCTRL_GPMI.B.DIV = IntDiv;

            HW_CLKCTRL_CLKSEQ.B.BYPASS_GPMI = 0;

I suggest to perform the GPMI frequency change at the first step and then followed by the GPMI block soft reset (HW_GPMI_CTRL) using the procedure at "39.5.10 Correct Way to Soft Reset a Block" in i.MX28 reference manual. Finally, just enable BCH block (without soft reset). Could you please try this and let me know the result?

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi again

After a second look to the error we have it seems the first error comes from the DMA. BCH timeout looks as a consequence of a DMA error.

is there a possibility that uboot2009.8 does not contain latest DMA driver updates?

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

I have verified the GPMI frequency change procedure in u-boot using this patch and I don't see any NAND read issue in my test. The patch switches the GPMI clock from 24MHz sourced from XTAL to 40MHz (arbitrary value) sourced from PLL0. So, you can try to use the code to change the GPMI frequency change.

About the u-boot DMA error, the mxs_dma_go() only uses an arbitrary fixed value "timeout = 10000" to poll HW_APBH_CTRL1 10000 times for completion. If for some reason the DMA does not complete in time, the error will occur. You may try using a larger timeout value.

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

I gave a try to the patch (modified to get 96MHz as GPMi clock) and the u-boot does not work anymore. Looks like GMPI NAND driver is not functional with this approach in our environment.

About the timeout value of mxs_dma_go, I already tried before to set the timeout to a voluntary very high value (10 seconds). This lowered the number of occurence, but still did not solve the issue totally (could reproduce between 0.1 and 0.8% of the occurences)

An interesting statistic is that:

- if we use 'reboot' command from linux we see the issue in 0.1% of the cases (over 4000 iterations)

- if we use 'poweroff' command (which is wried to a reset of the board in the current kernel) we see the issue in 0.8% of the cases (over 7000 iterations)

Do you know any difference in the poweroff vs reboot sequences on i.MX28? Could not differentiate them.

Thomas

0 Kudos

1,795 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

BTW, if you do not change the GPMI freq and its clock source and use what is given from boot ROM, will you see any boot issue in u-boot or kernel? This will help to validate the boot issue in NOT caused by improper NAND timing parameters due to high GPMI clock freq.

Thanks,

Peter

0 Kudos

1,795 Views
ThomasBandelier
Contributor II

Hi Peter,

I will try to retest this with default BOOTROM clocking .


Thomas

0 Kudos

1,796 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

With the default BOOTROM clocking, do you still see the BCH error?

Peter

0 Kudos

1,785 Views
ThomasBandelier
Contributor II

Hi Peter,

I finally had the opportunity to come back on this.

I tried defaulting the gpmi clock setting to what is in default u-boot (24Mhz, ref: pll.0) and it seems the behaviour is much better. But at the cost of performance.

Our config was 96MHz for gpmi clk for performance reasons. I tried to go down to 80 MHZ but still the issue persisted.

What is a reasonable tradeoff between speed and stability from FSL perspective ? Have you other feedback from other customers on this ?

Thanks

Thomas

0 Kudos

1,785 Views
PeterChan
NXP Employee
NXP Employee

Hi Thomas,

I do similar test by changing GPMI clock to 96MHz in u-boot on EVK but I don't see any BCH error or boot failure after 400 times. I will keep it going until tomorrow.

The BSP I am using is L2.6.35_1.1.0_130130 and in this release, both GPMI and BCH will be soft reset in u-boot and kernel. In my test, the GPMI clock will change to 96MHz at u-boot. The uImage is loaded from u-boot from a MTD partition offset. The rootfs is mounted to an ubi volume and a init script was wrote to reboot the system.

Since errata 2847 is only applicable to nfc v0 but i.MX28 is having nfc v1, it looks that my original post here is mis-leading.

Even though the GPMI is 96MHz, the NAND part I am using is a slow Samsung part K9F1G08U0A and its timing is BSP default. If you use a fast timing with a tight margin, can the BCH error or boot failure cause by inappropriate NAND timing? Will you still this problem if using a relax NAND timing (longer setup/hold time)?

Unless inappropriate timing is used or changing GPMI clock during NAND read/write, I don't see any tradeoff between GPMI speed and stability.

Thanks,

Peter

0 Kudos

1,785 Views
ThomasBandelier
Contributor II

Hi Peter !

Thank you for feedback, much appreciated

We are using the minimum setup and hold time that the NAND supplier publishes. Increasing the margins could be a lead, indeed.

Note that the error might be reproduced only after some K iterations, depending on the board .

I think we will continue in this direction on our side (relaxed NAND timign). Let me know if you manage to get the error on your side with high GPMI clock

BR,

Thomas

0 Kudos