On 50% or so of our custom i.MX6 solo boards we're seeing a crash when an HDMI monitor is connected. The crash occurs once the system is fully booted and the HDMI cable is inserted. There is no useful information spit out on the console port, and the crash seems to be very low-level as the user LED (tied to a hearbeat trigger) stops flashing. Additionally, the kernel will halt during boot if an HDMI cable/monitor is connected with no useful output (see attached boot log).
Furthermore, i've been able to determine that its not just an HDMI monitor (of which I've tried many different types) being connected which causes the crash. I can cause the crash simply by pulling the HDMI HPD pin to 5v.
Lastly, and hopefully this is the smoking gun: Normally we use a kernel with a bundled initramfs which mounts/loads a squashfs and an aufs overlay rootfs. The crash does not occur on boards that are known to fail if I use a kernel that does not have a bundled initramfs, even though the version and defconfig are identical (obviously the boot does not complete, but it fails at the init as expected and I get an image on the HDMI).
My bundled uImage is ~9MB in size. The uImage is loaded to 0x12000000, dtb loaded to 0x18000000. Linux is based on the 3.10.17 GA release, as is the rootfs and initramfs (built in Yocto).
So, my questions are:
1.) Why would a bundled initramfs cause this failure? When no HDMI is connected it seems to operate just fine.
2.) Why does this only occur on 50% of our boards? Boards from the same batches will work or fail, there does not seem to be a common hardware issue.
Thanks,
-Allan
Allan,
I remember seeing hang issues when connecting HDMI cable with 3.10.17.
Can you try a mainline kernel, such as 3.19? I never saw such HDMI hang in mainline.
If I recall correctly the hang did not happen if the IPU was not used in U-boot. Do you have splash screen enabled in U-boot?
Please check this thread:
https://lists.yoctoproject.org/pipermail/meta-freescale/2014-July/009434.html
Hi Fabio
Thanks for bringing in your experience to this and
helping to set the right direction. Allan, I am sorry if I
mislead you towards potential memory problems. I
am glad we now have a solution but it would be nice
to know the real reason and the cause of the problem.
Best regards
Sinan Akman
Agreed, I've worked around this issue on some of our boards but it would be nice to know why it doesn't affect all of them.
The root cause appears to be a a HDMI PHY Frame Composer Overflow interrupt storm in Linux when HDMI has already been enabled by u-boot. The rate of the interrupt seems to very by temperature on some of my boards. If the rate is slow enough the kernel will recover, fix the interrupt and boot.
Hi Daniel,
We're facing the same issues. Could you please share workarounds you had to apply?
Wow, that seemed to do it! Thanks Fabio!
Any word on why disabling the HDMI in u-boot would cause a crash on some boards but not others?
Hi Alan
Now this might seem to be not related but perhaps you
do have a ram issue that shows itself when you stress
it with your ramdisk usage. I never saw what you
are reporting but on the boards that I worked on
which had ddr issues (either setup or signal integrity)
it would only show up during an NFS root as this was
triggering burst mode. So I wonder, instead of your
ramdisk, if you could test those boards (or some others
which look healthier) with an NFS root file system.
If you do see then similar results, I would recommend
to focus on a potential memory issue. Again, this is
perhaps not related but I would recommend to give
a try.
Hope this helps
Sinan Akman
Hi Sinan-
Thanks for the suggestion. I tested the offending board with the DDR3 Stress Tester using the calibration values I previously obtained and put in U-boot, and the board passed 100% over a couple hours of testing.
Also, mtest in U-boot seemed to pass easily over a range of memory values.
Do you think thats sufficient for testing memory or do I need to run something like stressapptest? Unfortunately running in an NFS is tough as we don't have an ethernet connection, just WiFi.
-Allan
Hi Alan
There was at least one case on a customer board that
the memory tester didn't catch the problem but this
was couple years ago and some tester programs might
have since improved. What matters there is if the test triggers
burst mode and AFAIK uboot mtest is a simple patter write
read back test. I don't remember if Stress Tester does anything
in that direction. Perhaps it also only writes/reads circulating
data and address patterns. This is something you can maybe
verify. As for only having wifi, you probably already thought
of this but if your wifi is pcie based, would it be possible
to have a pcie ethernet card instead. Also, if we were to
consider a possible memory issue, I wonder if you could
go over the design layout recommendations and verify
against your board. Likewise, is running your memory
at a slower speed or relaxing some of the calibration
values make any difference. This would at least confirm
if this is the right direction to take.
Regards
Sinan Akman
Hi Sinan-
I ran memtester and was unable to generate any errors after a few iterations of testing. I'm currently working on getting the stressapptest into my build and up and running, hopefully that yeilds something different.
Regarding the layout, we did have FSL verify it and I know it meets all the design recommendations. In terms of speed, I'm currently running at 400mhz.
Unfortunately the NFS mount isn't much of an option in the near future, as I'll need to order the adapter. In the meantime, can you think of other ways I might be able to trigger burst mode?
-Allan
Hi Alan
Unfortunately I don't know of any other practical setup
that would trigger burst mode but if you take a look
at the datasheet of your chip it might explain how and
when this happens. I understand 400Mhz is the lowest
your chip supports ? Can you change some of the
timing values for longer delays etc according to ranges
defined in the datasheet. FSL definitely did a good job
reviewing your design but in the past I did see memory
problems despite it passed a basic layout review. If you
have time and resources I'd suggest to go over the design
considerations once yourself. Between this and perhaps
modifying controller values more on the relax side of
the specs you can identify if this is the right direction to
chase further.
I felt your case might be memory related but if we step
back a moment, you originally mentioned that the problem
occurs when you plug off the hdmi cable. Is there any
other way (any other interface plug in and off) causes
this error ? Also can you scope the hdmi lines to see what
is happening when you pull the hdmi cable. Is there any
unusual spike or any pattern that you don't see on
the working boards ? If you do a high speed capture
you might potentially find a hint for anomaly. You could
also scope the DDR lines while you are removing the
hdmi cable. When the system freezes are the memory
lines still at a sane value ?
Sorry I couldn't help much but please let me know
if there is anything else I can be of any help.
Regards
Sinan Akman