Problem with EXT4 corruption on SD card on custom MX6Q board

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Problem with EXT4 corruption on SD card on custom MX6Q board

7,661 Views
dcrutchley
Contributor II

Hi,

We are developing a custom board closely based on the MX6Q SabreSD dev board. There are some differences.

We currently do not have any networking capabilities on our rev 1 board but on rev 2 we are adding a Marvell PHY and will enable gigabit networking. We are using microSD instead of full-size SD. We have microSD on SD2 (currently not fitted except for a resistor pull-up), eMMC on SD3 and microSD on SD4. We are handling the power management ourselves so also don't use the PMIC.

We have ported the MX6Q sabre SD u-boot files and kernel device tree to accommodate our differences compared to the sabreSD board. So from u-boot we our own device with a board.c, board.h, board_common.h, board_defconfig, and mods to the KConfig files as appropriate. For the kernel we have our own device tree files similar to the sabre board.dts and board.dtsi.

At this point u-boot works and loads the kernel and we log into into our minimal OS via putty and have access to our root filesystem and can run various apps in the terminal. We are booting from a microSD card in SD4.

We thought all was going well but we tried to extract the contents of a large tar file (tar is 475MB approx) from the micro SD card which we booted from into a folder on the eMMC and part way through we saw some EXT4 filesystem errors flash by of the type "ext4-fs error (device mmcblk2p1): htree_dirblock_to_tree:1007: inode....." and then we see a message flash by reporting it is remounting our filesystem as read only. From this point on each file that is trying to be extracted from the tar file fails and it complains because now the file system is read only and actually it turns out to be corrupted. We have to reflash the SD card.

Has any one encountered this issue before? It happens every time we try to do too much I/OI from the SD card.

We have also built a sabre SD dev board compatible version of u-boot, the kernel and root filesystem from the same code base and used the same microSD card but in a micro to full SD converter to run the image on the SabreSD dev board. Performing the same tests repeatedly on the SabreSD board we get no errors everything works fine. We have also tried several microSD cards and we find that on the SabreSD dev board it all works on our board we get EXT4 filesystem corruption randomly occurring.

We are using U-Boot 2018.11 and Linux Kernel 4.19 RT and minimal Debian 9.9 stretch for our minimal OS and root filesystem. We create our image following the steps here: i.MX6q SABRE Board for Smart Devices - Linux on ARM - eewiki . We have tried Yocto in the past but wanted to move to a newer kernel and u-boot version. So I haven't yet tried a Yocto build for a while to see if we have the same issues with our board.

Any help/suggestions would be very much appreciated.

Kind regards

Duncan

Labels (3)
0 Kudos
19 Replies

6,437 Views
dcrutchley
Contributor II

Thanks for the detailed help and suggestions.

I will sit down when I next get a chance and start going through this. I'll try and get devmem built into our image as you suggest and will go from there.

Best regards,

Duncan

0 Kudos

6,437 Views
dcrutchley
Contributor II

Hi Tom,

I found time to get memtool (instead of devmem) installed on our board as we are running Debian stretch 9.9 minimal on top of Linux Kernel 4.19 and was able to dump the registers as you suggested. Comparing our output with that from our SabreSD board and we could see come differences but they were all what we'd expect, for example we aren't using any display outputs (HDMI, LVDS etc) and we have different RAM. Otherwise there weren't any standout differences that were alarming.

I am now going to check dmesg as well as running in 1-bit mode among other things and will post my findings.

Many thanks

Duncan

0 Kudos

6,437 Views
dcrutchley
Contributor II

I have tried to run in 1-bit mode and it makes it through u-boot and most of the way loading the kernel and gets the following error:

[   12.512700] mmc2: Timeout waiting for hardware interrupt.
[   12.518107] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
[   12.524553] mmc2: sdhci: Sys addr:  0x3eb15000 | Version:  0x00000002
[   12.530999] mmc2: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000008
[   12.537446] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x0000003b
[   12.543892] mmc2: sdhci: Present:   0x01fd820e | Host ctl: 0x00000013
[   12.550339] mmc2: sdhci: Power:     0x00000002 | Blk gap:  0x00000080
[   12.556783] mmc2: sdhci: Wake-up:   0x00000008 | Clock:    0x0000003f
[   12.563231] mmc2: sdhci: Timeout:   0x0000008f | Int stat: 0x00000000
[   12.569678] mmc2: sdhci: Int enab:  0x117f100b | Sig enab: 0x117f100b
[   12.576122] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000003
[   12.582567] mmc2: sdhci: Caps:      0x07eb0000 | Caps_1:   0x0000a000
[   12.589012] mmc2: sdhci: Cmd:       0x0000123a | Max curr: 0x00ffffff
[   12.595458] mmc2: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00ee7c7f
[   12.601903] mmc2: sdhci: Resp[2]:   0x325b5900 | Resp[3]:  0x00000900
[   12.608348] mmc2: sdhci: Host ctl2: 0x00000000
[   12.612796] mmc2: sdhci: ADMA Err:  0x00000003 | ADMA Ptr: 0x4d061204
[   12.619239] mmc2: sdhci: ============================================

To set 1-bit mode I have done the following in our u-boot board.c file:

static iomux_v3_cfg_t const usdhc4_pads[] = {
    IOMUX_PADS(PAD_SD4_CLK__SD4_CLK   | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_CMD__SD4_CMD   | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_DAT0__SD4_DATA0 | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
};

static const struct boot_mode board_boot_modes[] = {
    {"sd2",     MAKE_CFGVAL(0x40, 0x28, 0x00, 0x00)},
    {"sd4",     MAKE_CFGVAL(0x40, 0x18, 0x00, 0x00)},
    {"emmc", MAKE_CFGVAL(0x60, 0x50, 0x00, 0x00)},    
    {NULL,     0},
};

In our device tree dtsi file I have done:

pinctrl_usdhc4: usdhc4grp {
            fsl,pins = <
                MX6QDL_PAD_SD4_CMD__SD4_CMD        0x17059
                MX6QDL_PAD_SD4_CLK__SD4_CLK        0x10059
                MX6QDL_PAD_SD4_DAT0__SD4_DATA0        0x17059
            >;
        };

(we also tried values of 0x17019 and 0x10019 in the device tree as well but it didn't help).

We also set the dip switches on our board for 1-bit mode instead of 4-bit mode.

Is there anything else we should do to enable 1-bit mode?

If I revert the changes back to our 4-bit mode setup the board boots again but we still have the EXT4 issue.

0 Kudos

6,437 Views
igorpadykov
NXP Employee
NXP Employee

Hi Duncan

also in dtsi file set bus-width as described in

sect.3.3.6 Device Tree Binding atatched Linux Manual.

Best regards
igor

0 Kudos

6,437 Views
dcrutchley
Contributor II

Hi Igor,

In our device tree we have:

pinctrl_usdhc4: usdhc4grp {
    fsl,pins = <
        MX6QDL_PAD_SD4_CMD__SD4_CMD        0x17059
        MX6QDL_PAD_SD4_CLK__SD4_CLK        0x10059
        MX6QDL_PAD_SD4_DAT0__SD4_DATA0        0x17059
        >;
    };

and

&usdhc4{
    pinctrl-names = "default";
    pinctrl-0 = <&pinctrl_usdhc4>;
    bus-width = <1>;
    broken-cd;
    disable-wp;
    no-1-8-v;
    keep-power-in-suspend;
    enable-sdio-wakeup;
    status = "okay";
};

Still getting the issue above when Linux is booting in 1-bit mode. Is there anything else we need to do?

Kind regards

Duncan

0 Kudos

6,437 Views
testbed
Contributor III

Hello Duncan,

So, did this issue get resolved ? We have a custom i.MX6S board that is exhibiting the same behavior.

Regards

- Rishi

0 Kudos

6,437 Views
dcrutchley
Contributor II

Hi Rishi,

We resolved this issue.

After a lot of head-scratching and confusion we realised we had issues with our RAM configuration in out layout, which was causing the RAM calibration values we'd set to be ever so not quite right. They were correct enough to allow it to boot and to interact with via PuTTY but as soon as any operation started that was intensive and required transferring non-trivial amounts of data to/from RAM, such as to/from a network socket  or to/from an SD card then we'd get the issue.

Once we resolved our RAM issues by making a small design change to our board and re-calibrated the RAM we have since never had the issue again.

Best regards

Duncan

0 Kudos

6,437 Views
testbed
Contributor III

Hello Duncan,

Thanks for the information. May i know how you figured out that there was an issue with the RAM ?

My issue is during access to SD Card, where a hardware interrupt timeout is followed by filesystem errors like shown below. It does not mount the FS as read-only though. Tried with all acceptable pin configurations for the USDHC interface in 4 bit mode.

devtmpfs: mounted
Freeing unused kernel memory: 1024K
mmc2: Timeout waiting for hardware interrupt.
sdhci: =========== REGISTER DUMP (mmc2)===========
sdhci: Sys addr: 0x8c04b000 | Version:  0x00000002
sdhci: Blk size: 0x00000200 | Blk cnt:  0x00000027
sdhci: Argument: 0x0001af5e | Trn mode: 0x00000033
sdhci: Present:  0x01ed8a8e | Host ctl: 0x00000013
sdhci: Power:    0x00000002 | Blk gap:  0x00000080
sdhci: Wake-up:  0x00000008 | Clock:    0x0000003f
sdhci: Timeout:  0x0000008f | Int stat: 0x00000000
sdhci: Int enab: 0x107f100b | Sig enab: 0x107f100b
sdhci: AC12 err: 0x00000000 | Slot int: 0x00000103
sdhci: Caps:     0x07eb0000 | Caps_1:   0x0000a007
sdhci: Cmd:      0x0000123a | Max curr: 0x00ffffff
sdhci: Host ctl2: 0x00000000
sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x6c048210
sdhci: ===========================================
mmcblk2: error -110 transferring data, sector 110430, nr 224, cmd response 0x900, card status 0xb00
mmc2: Timeout waiting for hardware interrupt.
sdhci: =========== REGISTER DUMP (mmc2)===========
sdhci: Sys addr: 0x8c04b000 | Version:  0x00000002
sdhci: Blk size: 0x00000200 | Blk cnt:  0x00000027
sdhci: Argument: 0x0001af5e | Trn mode: 0x00000033
sdhci: Present:  0x01ed8a8e | Host ctl: 0x00000013
sdhci: Power:    0x00000002 | Blk gap:  0x00000080
sdhci: Wake-up:  0x00000008 | Clock:    0x0000003f
sdhci: Timeout:  0x0000008f | Int stat: 0x00000000
sdhci: Int enab: 0x107f100b | Sig enab: 0x107f100b
sdhci: AC12 err: 0x00000000 | Slot int: 0x00000103
sdhci: Caps:     0x07eb0000 | Caps_1:   0x0000a007
sdhci: Cmd:      0x0000123a | Max curr: 0x00ffffff
sdhci: Host ctl2: 0x00000000
sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x6c048210
sdhci: ===========================================
mmcblk2: error -110 transferring data, sector 110430, nr 224, cmd response 0x900, card status 0xb00
mmcblk2: retrying using single block read
EXT4-fs error (device mmcblk2p2): ext4_iget:4643: inode #53: comm init: bad extended attribute block 446800
EXT4-fs error (device mmcblk2p2): ext4_find_dest_de:1842: inode #2: block 1197: comm systemd: bad entry in directory: directory entry across range - offset=0(0), inode=4158080121, rec_len=61308, name_len=0
systemd[1]: Failed to mount sysfs at /sys: No such file or directory

EXT4-fs error (device mmcblk2p2): ext4_find_dest_de:1842: inode #2: block 1197: comm systemd: bad entry in directory: directory entry across range - offset=0(0), inode=4158080121, rec_len=61308, name_len=0
systemd[1]: Failed to mount proc at /proc: No such file or directory
EXT4-fs error (device mmcblk2p2): ext4_find_dest_de:1842: inode #2: block 1197: comm systemd: bad entry in directory: directory entry across range - offset=0(0), inode=4158080121, rec_len=61308, name_len=0
[!!!!!!] Failed to mount early API filesystems, freezing.
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004

Thanks & Regards

- Rishi

0 Kudos

6,437 Views
larry2
Contributor I

Ext4 is attempting to protect the filesystem from underlying block storage errors. As suggested look to the underlying driver or hardware. I have seen this many times, but on servers running in distributed block storage systems when there was an underlying failure of the virtual block device. Ext4 would have a similar failure and then remount RO. 

0 Kudos

6,437 Views
TomE
Specialist II

Larry is right. You're running EXT4, and the means there's a lot of layers between the filesystem and what is going wrong. Have you checked in "dmesg" to soo what errors are being logged? Have you checked to see if the system logs errors anywhere else? You may be able to build a version of the OS (or the drivers) that turns (or turns more) logging on so you cn see what is going wrong.

Otherwise you want to exercise the "Raw SD Device" so you can see the actual errors. You can't do that if you're booting from it. Can you boot from anything else so you can do this sort of test?

> we don't have devmem

If your system is running Busybox then in the setup program for that there's an option to build devmem. It would be worth learning how to add that in as there may be other optional items in Busybox that might be useful for you to have.

> I think what I need is in the ref doc IM6DQRM.pdf, chapter 36, page 1901.

> Where is lists the absolute hex addresses of the IOMUX registers. DOes this sound correct?

Yes. You haven't read that already? I would recommend reading the whole thing cover to cover, or at least reading the "introduction" sections to every chapter. Then when things go wrong and you get cryptic message in "dmesg" (or equivalent) you'll know where to look.

> if we build for our custom board (with appropriate changes) that we see the issue.

You may have bugs in that. I would recommend dumping *ALL* of the "36.4 IOMUXC Memory Map/Register Definition" registers - all 596 of them from 0x20e000 to 0x20e094c. Dump that to a file for the Sabre build and then do it again for your custom build. Then "diff" the files and try to justify every difference. If you find one or more you can't justify then you might have found your problem.

The OS doesn't have any requirement to show the addresses that are in the Reference Manual via devmen or /dev/mem. It is allowed to remap the memory to some other address. So watch out for that.

Tom

0 Kudos

6,437 Views
TomE
Specialist II

We're using an i.MX53. Its IOMUX Pad registers start at 0x53fa8348. So...

# for i in 0 1 2 3 4 5 6 7 8 9 a b c d e f; \
>  do for j in 0 4 8 c; \
>  do /oldroot/bin/echo -n 0x53fa83$i$j " "; \
>  devmem 0x53fa83$i$j; \
>  done; done
0x53fa8300  0x00000001
0x53fa8304  0x00000001
0x53fa8308  0x00000001
0x53fa830c  0x00000001
0x53fa8310  0x00000001
0x53fa8314  0x00000001
0x53fa8318  0x00000004
0x53fa831c  0x00000001
0x53fa8320  0x00000012
0x53fa8324  0x00000012
0x53fa8328  0x00000001
0x53fa832c  0x00000001
0x53fa8330  0x00000001
0x53fa8334  0x00000001
0x53fa8338  0x00000001
0x53fa833c  0x00000001
0x53fa8340  0x00000001
0x53fa8344  0x00000001
0x53fa8348  0x000001E4   <---- IOMUXC_SW_PAD_CTL_PAD_GPIO_19
0x53fa834c  0x000001E4
0x53fa8350  0x000001C4
0x53fa8354  0x000001E4
0x53fa8358  0x000001E4
0x53fa835c  0x000001E0
0x53fa8360  0x000001C0
0x53fa8364  0x000001E4
0x53fa8368  0x000001E4
0x53fa836c  0x000001E0
0x53fa8370  0x000001C0
0x53fa8374  0x00020000
0x53fa8378  0x00000401
0x53fa837c  0x00000401
0x53fa8380  0x00000400
0x53fa8384  0x00000400
0x53fa8388  0x000005E5
0x53fa838c  0x00000400
0x53fa8390  0x00000400
0x53fa8394  0x00000400
0x53fa8398  0x00000400
... And so on ...

I tried different variations on using "dd", "od" and "/dev/mem" and couldn't get it to work. So your best option is to enable 'devmem" in your busybox configuration ad use that like I did above.

Tom

0 Kudos

6,437 Views
TomE
Specialist II

Good suggestions from Igor.

I'd suggest more tests on the hardware. You should be able to find a way to make it run with a slower clock. That should make it work if the hardware (tracks or pin setup) is less than ideal. SD cards and controllers have to support "single wire mode" where they use one data pin instead of four. There should be a way to force it back to this mode and test it like that.

Don't trust the pin setup. Even if you think it is correct and think you're looking at the code that configures the pins, you may be mistaken.

Here's previous examples showing how badly wrong the distributions and the platform layer can actually be set up:

Here's the "top level":

https://community.nxp.com/message/593611

That points to these:

https://community.nxp.com/message/608067
https://community.nxp.com/thread/384340

So how can you tell how the pins are REALLY set up? The only true way is to read the pin setup and mux registers back and decode them yourself. Does your distribution have "devmem"? If not, it probably has /dev/mem, and you should be able to use dd on that to read the registers back:

https://unix.stackexchange.com/questions/464744/how-to-use-dd-if-dev-mem-in-place-of-devmem

Tom

0 Kudos

6,437 Views
dcrutchley
Contributor II

Hi Tom,

I've read through your posts they are certainly things we can try. The odd thing is if we build for our Freescale SabreSD dev board everything works fine just if we build for our custom board (with appropriate changes) that we see the issue. We have tried as much as possible to match the sabre board's design.

If we were to use dd (we don't have devmem) to inspect what is set (excuse the novice question) how do I find what addresses to look at for the various pin setup and mux registers for our SD card slots?

I think what I need is in the ref doc IM6DQRM.pdf, chapter 36, page 1901. Where is lists the absolute hex addresses of the IOMUX registers. DOes this sound correct?

0 Kudos

6,437 Views
dcrutchley
Contributor II

Thanks Tom,


We're still struggling with this so I'll read through those other posts and look into doing deeper tests. We are also waiting for revision 2 of our test board to arrive so when that is available we'll continue to test.

0 Kudos

6,437 Views
dcrutchley
Contributor II

So I have tried a build with Yocto from code aurora (latest build) and modified for our card and we have the same issue.

I have also, with our previous builds, now tried all combinations of Linux kernel 4.9, 4.14, 4.19 and 5.0 using ext3 and ext4 for the root file system and the problem persists.

Having discussed this with our electronics engineers they are doing a redesign of the board. They are going to length tune the SD lines, even though the lines are less than 3inches so technically shouldn't need length tuning. They are also moving the SD card to a different position on our board as it was quite close to a large oscillator.

If it isn't hardware/layout then could there be something we've missed when porting the device tree?

In u-boot in our ported board.c file (adapted from the sabre SD code) we have the same pad control as on the SabreSD, which we think should work with our microSD slot:

#define USDHC_PAD_CTRL (PAD_CTL_PUS_47K_UP |        \
    PAD_CTL_SPEED_LOW | PAD_CTL_DSE_80ohm |            \
    PAD_CTL_SRE_FAST  | PAD_CTL_HYS)

The I/O mux is set as follows:

static iomux_v3_cfg_t const usdhc4_pads[] = {
    IOMUX_PADS(PAD_SD4_CLK__SD4_CLK   | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_CMD__SD4_CMD   | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_DAT0__SD4_DATA0 | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_DAT1__SD4_DATA1 | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_DAT2__SD4_DATA2 | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
    IOMUX_PADS(PAD_SD4_DAT3__SD4_DATA3 | MUX_PAD_CTRL(USDHC_PAD_CTRL)),
};

In the kernel device tree we have the following (adapted from the sabre SD tree):

pinctrl_usdhc4: usdhc4grp {
            fsl,pins = <
                MX6QDL_PAD_SD4_CMD__SD4_CMD        0x17059
                MX6QDL_PAD_SD4_CLK__SD4_CLK        0x10059
                MX6QDL_PAD_SD4_DAT0__SD4_DATA0        0x17059
                MX6QDL_PAD_SD4_DAT1__SD4_DATA1        0x17059
                MX6QDL_PAD_SD4_DAT2__SD4_DATA2        0x17059
                MX6QDL_PAD_SD4_DAT3__SD4_DATA3        0x17059
            >;
        };

&usdhc4{
    pinctrl-names = "default";
    pinctrl-0 = <&pinctrl_usdhc4>;
    bus-width = <4>;
    non-removable;
    disable-wp;
    status = "okay";
};

One thing we have noticed is that in the pinctrl in the DTSI snippet 0x17059 when decoded shows this represents:

<‭bit 16> to <bit 0> ‭‭
1 - schmitt triggered
01 - 47kOhm pullup
1 - pullup
1 - pullup enabled
0 - CMOS output
000 - ?
01 - medium (100- 150MHz)
011 - 50Ohm (80 Ohm if DDR)
00 - ?
1‬‬ - fast slew rate

This seems to set the speed to medium (100-150MHz) but in u-boot it is set as PAD_CTL_SPEED_LOW (equivalent to 50MHz). We would have expected the setting to be 50MHz in the device tree also. Why is the SabreSD setup like this (with a difference between its u-boot and kernel settings)?

We have tried changing the device tree values to use low speed but it made no difference. We also have hardware pull-ups of 10 kOhm in parallel with the software pull-ups (those set in u-boot/device tree) so we have also tried removing these from our board but this also didn't help.

0 Kudos

6,437 Views
igorpadykov
NXP Employee
NXP Employee

for software one can look at similar issue described on

eMMC 8GB to 4GB - crash on linux (yocto) boot 

also one can try with different sd cards.

Just for test one can try to perform read/write baremetal usdhc test using SDK

(zip can be found on link SMP Enable in IMX6 )

Best regards
igor

0 Kudos

6,437 Views
igorpadykov
NXP Employee
NXP Employee

Hi Duncan

corruption may be caused by sudden turn of board power supply, it

is necessary to use linux poweroff command shutdown(8) - Linux manual page 

Also errors may be caused by sd lines layout described in sect.3.6.8High speed signal routing

recommendations i.MX6 System Development User’s Guide

https://www.nxp.com/docs/en/user-guide/IMX6DQ6SDLHDG.pdf

Recommended to use linux from nxp source.codeaurora.org/external/imx/linux-imx repository:
https://source.codeaurora.org/external/imx/linux-imx/tree/?h=imx_4.14.98_2.0.0_ga

Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

6,437 Views
dcrutchley
Contributor II

Thanks for the response. This isn't caused by powering off. As described this happens midway through extracting a large tar file. And running images built from this source built for the sabresd board work fine on the sabresd we see no corruption when running on there.

I have since rebuilt images using kernel 4.9, 4.14 and 5.0 and we see the same issues on our board with each of these but again not on the sabresd when we use equivalent builds.

On Monday I'll try using that code aurora branch I have now built that without any of our porting changes for our board so I could try it on our sabresd board first. However when building it got an error during the final step when it builds the SD card image. So I did a full clean and rebuilt again and same error occurred. I think the u-boot, spl and kernel all built though so I guess I'll flash them manually.

Kind regards 

Duncan

0 Kudos

6,437 Views
igorpadykov
NXP Employee
NXP Employee

Hi Duncan

since corruption is observed only on custom board this may

be due to signal integrity issues, caused by board layout of sd signals.

Best regards
igor

0 Kudos