We're currently working with an i.MX6DP based device and encountered a strange, yet critical issue.
Some of our devices (i.MX6DP rev1.1) hang when reaching a temperature of ~74-79C and then get restarted by a watchdog. At the moment of hang these are the last kernel messages we see:
[ 100.356301] imx-ipuv3 2400000.ipu: IPU Warning - IPU_INT_STAT_10 = 0x00080000
[ 100.623022] imx-ipuv3 2400000.ipu: IPU Warning - IPU_INT_STAT_
On the other hand, the rest of the devices works perfectly at ~90C.
Moreover, i.MX6DP rev1.0 devices didn't have this problem at all.
Could someone shed some light on this, please?
Is there per-revision errata we can check? I found this, but I don't see any revision-specific problems.
Software-wise we're using a customized Linux system based on Linux 4.9.11 from NXP.
Ok, so we managed to find a source of this problem.
Consider the following:
1. Ch. 46.4.8 "MMDC Refresh schemes” MMDC can use a 32KHz clock to refresh the DDR. This behaviour is controlled by MMDC_MDREF register. It’s confirmed with a factory that XTALOSC 32KHz clock will be used by MMDC.
MMDC_MDREF value is hardcoded in DDR3 Register Programming Aid spreadsheet from NXP and is set to use 32KHz and 8 refresh commands.
2. Ch. 72 “Crystal Oscillator (XTALOSC)“ iMX6DQPRM:
“Supply another ~32 kHz clock source based off an independent internal oscillator if there is no oscillation sensed on the RTC_XTALI bumps(contacts) (32 kHz specific feature). The internal oscillator will provide clocks to the same on-chip modules as the external 32 kHz oscillator. Automatically switch to the external oscillation source when sensed on the RTC_XTAL bumps(contacts) (32 kHz specific feature).“
The above means that XTALOSC will use an external 32KHz clock if it is sensed. Important: “sensed” is not well defined!
It turned out that in case of our board design we left RTC_XTALI not connected (with an option of mounting a 0Ohm resistor in case we want to use an external 32KHz clock). This resulted in a floating input on this line, which was wrongly sensed by the i.MX6 ROM as valid.
So in the end we got RAM Refresh to be completely random, which could explain why there was this weird temperature dependency.
Connecting an external 32KHz clock solved the issue for us, as well as pulling RTC_XTALI input down.
I guess the NXP Sample spreadsheets and tools used 32kHz because the SABRE board has the external RTC crystal installed by default.
Is there any reason why you can't use "ESDCTL_ESDREF[REF_SEL] = 2" for "fast clock cycles" or set it to zero for "64kHz"? Of course you wouldn't think of that because of the default NXP/Fresscale setup, which you'd assume works.
I now find our design has the above field set to zero, for a 64kHz Clock. We didn't set that deliberately either - that's the way it came.
"Table 44-3 MMDC Clocks" lists six clocks used by this module, but the table doesn't include the 32kHz or 64kHz Refresh clocks. You've found out where the 32kHz comes from (by having to ask the factory). The big "master clock table" that is "Table 18-3. System Clocks, Gating and Override" doesn't list "ckil" or "ckil"sync_clk_root" as being an input to the MMDC either.
Do you have any idea where the 64kHz one comes from? There is no mention anywhere in the Reference Manual of anything that could be a 64 kHz clock source. Is it perhaps a frequency-doubled (every edge) of the 32kHz one? That would have the same problems as using the 32kHz one!
If you search the manual for all references to "CKIL" you'll see it is used absolutely everywhere. It is a major player in the CPU startup and reset, and controls all sorts of things. NXP has comments all through their manuals hinting pretty strongly that "this really isn't optional, please install the crystal"...
We had eventually confirmed with NXP that 64kHz clock is indeed a frequency-doubled of the 32kHz one. The way we found the cause was experimenting with all these different REF_SEL settings.
In the end, we settled on fixing the 32kHz clock and didn't explore further options.
Now, indeed, in the reference manual NXP mentions that "an external 32kHz clock must be used in production" but not in the hardware development guide, that's why we missed that (not the best excuse, I know). Moreover, i.MX6 has an internal 32kHz oscillator which would do just fine with DRAM refresh (however, it's considered unstable and NXP don't recommend using it, see I.MX6 MMDC 32Khz refresh clock ).
You're right that this clock plays a major role sometimes: in our case, it was a WDOG (which could explain random reboot timings when the original issue happened) and broken HDMI CEC functionality.
Here's some more information about the problem.
When PRE/PRG are disabled - the abovementioned issue doesn't happen.
However, we started to see prints like this:
mxc_sdc_fb mxc_sdc_fb.0: timeout when waiting for flip irq
Which result in no output on a screen. We've also noticed that we start seeing prints within the same temperature range.
I checked here, and I saw others having this issue, but none of the suggested workarounds helped.
What is "PRE/PRG"? You didn't mention that before. I'm guessing that's the Prefetch engine (which you've said has given problems), and it looks to be related to the graphics;
Whatever it is it took hundreds and hundreds of lines of changes to the code to "support".
If turning it off fixes your problem, does it fix your problems? Do you need that on?
> I checked here, and I saw others having this issue,
Can you past in links to the other posts? Or at least good search strings to find them? Are the others having the same "1.0 vs 1.1" problems? The same temperature problems? The same "fb_mxc" problems, or something else?
If it really runs properly after being restarted at a high temperature, then that's really suspicious and suggestive.
Check the CPU Core voltages. Check the clocking. Make sure your oscillator isn't playing up.
I'm out of ideas.
Thanks for replying!
Turning PRG/PRE off fixes the hang - yes, but introduces another problem(s). As I mentioned above at some point we stop seeing an image on the screen and see prints like these:
"mxc_sdc_fb mxc_sdc_fb.0: timeout when waiting for flip irq"
When saying about "others having the same problems" I was referring to the abovementioned behavior. This one is the most similar, however, I saw other people complaining about the problem here and there. I don't know how it's related to the original issue if it's at all, but there's something weird going on with IPU part.
We checked the CPU core voltages and they're within limits. Our external oscillator seems to be fine too.
The big problem is that we don't understand why it's happening on some of our boards and on some - doesn't. We verified with X-Ray that there're no differences, tried replacing RAM and i.MX6 SoC on faulty ones. No luck.
One thing I found interesting is:
"The ipu_panic signal can be used for indicating about errors that are result of data rate problems. Such problems may be a result of the IPU running in slower clock then required by the use case. This signal can be used in order to indicate the system that the IPU can't handle the desired data rate. In that case the system may need to increase the clock to the IPU or simplify the use case." iMX6DQPRM.pdf p. 2858
And then in EB810 (i.MX 6Dual/6Quad and i.MX 6DualPlus/6QuadPlus Applications Processor Comparison), Chapter 5 "Clock Generation" it says that a faster clock is available (and used by default) for GPU (528MHz). So I was wondering if we could re-use the same clock for IPU since our use case is to render a web page (GPU involved) and then display it?
Anyhow, thanks a bunch for your help!
one can try with nxp demo images from
i.MX6DP rev1.1 fixed only security related errata
Note: If this post answers your question, please click the Correct Answer button. Thank you!
Thanks for your reply!
Unfortunatelly we can't try NXP demo images since they're Sabre board-specific.
After debugging, I've got some extra information: If we reset the device immediately after the hang and it boots into temperature > ~77C (because it had no time to cool down) - we don't see the issue anymore, and everything works well even at 95/97C!
I've tried disabling the Thermal Driver completely - it didn't change the situation.
Also, sometimes I see these prints (according to docs it's coming from the Prefetch engine):
imx-pre 21c8000.pre: handshake abort
imx-pre 21c8000.pre: handshake abort
imx-pre 21c8000.pre: handshake abort
Any idea what can be wrong? Don't know if it's related...
Is that 79C Ambient, 79C on the top of the CPU or 79C Die Temperature? If you're at 79C Ambient, then the CPU may be running a lot hotter than that, and may be beyond the maximum temperature.
Read the following, which was a similar problem on an i.MX53. The Memory Technology is similar, so it is probably a similar problem.
In our case we found that some boards worked OK over a wide temperature range, while others failed at high and low temperatures. The "Good Ones Versus Bad Ones" depended CYCLICALLY on the Serial Number! This looked like a difference in the internal layer thicknesses in different parts of the large printed circuit Panel that had multiple boards on it. That probably gave different track impedances for the different boards.
That matches your symptoms, of "some boards are OK and others aren't".
We were able to make all boards reliable by changing the Drive Strength and impedance settings from "5" to "6". That wasn't helped by the manual which wrongly stated that "5" and "6" had the same "Drive Strength" due to an uncorrected cut-and-paste bug in one of the tables.
DDR2 and DDR3 Memory Systems are meant to be "recalibrated" at very frequent intervals. The Memory Controller is meant to be told (about once per second!) to measure various impedance and delay parameters and to then correct them.
This is probably critical in something like a Desktop computer where the memory is on multiple SIMs that are plugged into sockets on a long bus. On an embedded system with one or more memory chips situated right next to the CPU you can probably get away with not doing this at all, or only partially doing it.
That's the "theory", but I couldn't find anything in the Linux Port for our board that actually did that. Even the Bootstrap avoided doing a "Calibration". The boards are set up ONCE by the Bootstrap with fixed parameters, and then they don't change them. In your case (with it working when restarted from hot), it is likely that the Bootstrap runs "one calibration pass" when it starts, and then it stays the same from there. So a "hot start" sets the parameters appropriate to that temperature. But not if the temperature then changes.
The way to investigate this properly is to run the "Memory Stress Test" for your CPU and then graph "Maximum memory Frequency Versus Temperature". You can see one of those in the linked post. In that one the temperature is that measured on the top of the CPU package with a thermocouple. Yes, I went up to 110C!
Without running this sort of test on your design, using multiple boards and at multiple temperatures you have no idea of the "Margin" that you have available in your product. If you're running a 400 MHz Bus and it passes the test at 520 MHz at all temperatures, then you have a good margin. If the maximum frequency drops to 420 MHz (like it did for us) then there's no margin, and if it drops to 350 MHz then it certainly won't work.
The boards that failed the memory test at high temperatures failed at lower temperatures when running Linux. That's why you need a good "margin" on the memory test.
(Edit) And another complication with your setup is that it looks like something (probably the boot) runs one "calibration pass" on startup. So the memory setup depends on the temperature when that pass was run. So you should test them powered-on at 20C and then run tested at hot and cold temperatures. Then repeat powering up "hot" and running up and down from there. So instead of graphing "maximum clock speed versus temperature" as a 2D graph you need "maximum clock speed versus starting temperature versus test temperature" as a 3D graph.
You really want the system continuously calibrating the memory all the time. Good luck with that, and let us know if you find it.
I think that all the "Thermal Driver" lets you do is to measure the temperature of the CPU and make it available to "User Space" code. I don't think it has anything to do with controlling or running calibration passes on the memory system.
Here's 50 pages on how to calibrate your memory system. Someone had better have gone through all of this for your board.
The following may be of interest (and mentions changing temperatures). It is possible your Boot isn't setting up as advised:
Once the MMDC and DDR device have been initialized, ZQ calibration should be placed in automatic mode to ensure proper operation. If ZQ calibration mode is not placed in automatic mode, silicon temperature changes, such as the initial rise in temperature due to start up, may degrade DDR electrical signals and cause memory errors.