First: Can someone at NXP reopen this issue?
I found the rootcause of this "intermittent" issue. There looks to be a bug in the eLCDIF controller with respect to resetting it with the SFTRST bit. Register LCDIF_CTRLn (0x021c8000, reference manual paragraph 35.6.1) is the subject of the issue.
The reset sequence should be as follow:
1-Make sure the clk is not gated, so make sure bit 30 (CLKGATE ) is cleared.
2-Assert the SFTRST bit 31.
3-Now the controller starts it reset sequence and it will automatically set the clk gate bit.
4-Wait for the CLKGATE bit to be set.
Although is it poorly documented for the eLCDIF, you can find the relevant parts in all control blocks, they all work similar.
Now the interesting part: The CLKGATE bit is not always asserted! The controller is really in a defective state if this happens, trying to set the bits manually to recover the controller fail.
Is issue already pops up in U-Boot, when the controller fails Linux also fails, the controller is deadlocked.
The easiest way to show the sequence and to reproduce it is modifying the U-Boot sources.
The above reset sequence is implemented in the function "mxs_reset_block" and is located in "arch/arm/imx-common/misc.c"
This part times out:
if (mxs_wait_mask_set(reg, MXS_BLOCK_CLKGATE, RESET_MAX_TIMEOUT)) {
return 1;
}The returncode is never checked by the calling function so you won't notice it (apart that the splash screen does not work) until linux boots and shows nothing or garbage.
How to reproduce (In U-Boot):
In "drivers/video/msxfb.c" the function mxs_lcd_init() calls the mxs_reset_block() ->check the return code of the function, display some failing message and patch the bootcmd environment variable with something non-existing so you are sure linux does not boot.
ret = mxs_reset_block(®s->hw_lcdif_ctrl_reg);
if (ret != 0) {
printf("Video controller reset failed: %d!\n", ret);
setenv("bootcmd", "reset-failed"); // Just something that never boots.
return;
}Now put the image on your sdcard (make sure you have some boot delay) and stop the boot in the console.
Set the bootcmd variable to "reset" and do a save of the environment:
setenv bootcmd reset
saveenv
Now reset the board so it starts looping until the failure is shown, this may take a few minutes but it can also be half hour.
Additional info: In my setup I use a splash screen of 1024x600 pixels with the following timings:
{
.bus = MX6UL_LCDIF1_BASE_ADDR,
.addr = 0,
.pixfmt = 24,
.detect = NULL,
.enable = do_enable_parallel_lcd,
.mode = {
.name = "T700A04X00",
.refresh = 60,
.xres = 1024,
.yres = 600,
.pixclock = 20460,
.left_margin = 144,
.right_margin = 40,
.upper_margin = 18,
.lower_margin = 1,
.hsync_len = 104,
.vsync_len = 3,
.sync = 0,
.vmode = FB_VMODE_NONINTERLACED
}
}I checked all relevant registers to see if anything is wrong during the reset such as:
PLL5 dividers, num/den, post dividers, clock gating etc. All are OK as far as I can see. So it really smells like a hardware bug in the chip.
P.S. There is also a bug in the "arch/arm/cpu/armv7/mx6/clock.c" "mxs_set_lcdclk()" PLL5 calculation, an integer overflow occurs in there that results in a wrong dotclock (it is off by a few 100'ts of kHz). I fixed that one but it is not the cause of the issue.
Until now I have no descent workaround for this issue.
The only workaround I can think of is just resetting the board with the watchdog when I detect the clock failure, and hopefully it will recover. But this is an extremely dirty one.