i.max8mm TMU report 112C when the board is in -40C,which cause system reboot

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

i.max8mm TMU report 112C when the board is in -40C,which cause system reboot

1,433 Views
wenfu
Contributor III

TMU report 112C when the board is in -40C,which cause system reboot. 

cat /sys/class/thermal/thermal_zone0/temp periodically. 

...

0
0
0
0
112000
[ 9756.219789] System is too hot. GPU3D will work at 1/64 clock.
[ 9756.225583] hantro receive hot notification event: 1
[ 9756.230857] thermal thermal_zone0: critical temperature reached (112 C), shutting down

Message from syslogd@maaxboard at Apr 11 19:11:08 ...
kernel:[ 9756.230857] thermal thermal_zone0: critical temperature reached (112 C), shutting down
112000
[ 9756.499770] thermal thermal_zone0: critical temperature reached (112 C), shutting down

Message from syslogd@maaxboard at Apr 11 19:11:08 ...
kernel:[ 9756.499770] thermal thermal_zone0: critical temperature reached (112 C), shutting down
logout
[ 9756.763855] thermal thermal_zone0: critical temperature reached (112 C), shutting down
[ 9757.031798] thermal thermal_zone0: critical temperature reached (112 C), shutting down
[ 9757.295826] thermal thermal_zone0: critical temperature reached (112 C), shutting down
[ 9757.488292] systemd-shutdow: 55 output lines suppressed due to ratelimiting
[ 9757.560345] thermal thermal_zone0: critical temperature reached (112 C), shutting down
[ 9757.694533] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 9757.801168] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 9757.817292] systemd-journald[1939]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 9757.827773] thermal thermal_zone0: critical temperature reached (112 C), shutting down
[ 9757.868861] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 9757.884837] systemd-shutdown[1]: Unmounting file systems.
[ 9757.893417] [22553]: Remounting '/' read-only in with options 'data=ordered'.
[ 9757.920306] EXT4-fs (mmcblk0p2): re-mounted. Opts: data=ordered
[ 9757.938902] systemd-shutdown[1]: All filesystems unmounted.
[ 9757.944529] systemd-shutdown[1]: Deactivating swaps.
[ 9757.949750] systemd-shutdown[1]: All swaps deactivated.
[ 9757.955012] systemd-shutdown[1]: Detaching loop devices.
[ 9757.964286] systemd-shutdown[1]: All loop devices detached.
[ 9757.969956] systemd-shutdown[1]: Detaching DM devices.
[ 9757.998893] kvm: exiting hardware virtualization
[ 9758.019747] reboot: Power down

0 Kudos
9 Replies

1,288 Views
igorpadykov
NXP Employee
NXP Employee

Hi Wen

according to Datasheet Table 9. Operating ranges : "Sensing temperature range 10°C to 105°C"

i.MX 8M Mini Applications Processor Datasheet for Industrial Products

Best regards
igor
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,288 Views
wenfu
Contributor III

Hi igorpadykov,

Thanks for your replay. I have read the datasheet, and I konw the TMU can just report temperature between 10-105C no matter what real temperature it really is.

The condition is put the board to -40C incubator, yes ,it is -40, then after a while the board report 112C suddenly! The real temperature is -40C but the board report 112C then the kernel thermal driver was cheated and the CPU was turned off.

If I change the kernel driver to ignore the temperature it get, the board runs good.

Please have good look about this, it might come down to a hardware bug in TMU.

0 Kudos

1,288 Views
igorpadykov
NXP Employee
NXP Employee

Hi Wen

 

it is not hardware bug, it works normally only in temperature range 10°C to 105°C".

 

Best regards
igor

0 Kudos

1,288 Views
wenfu
Contributor III

Hi igorpadykov,

The chip is Industrial silicon version. The datasheet says its temperature range is -40--105°C. Even the consumer version can works good between 0--95°C. The most importand thing is this caused the CPU reset, which is off normal. If the real temerature is bellow 0, and the TMU reports 0°C, it make sense. But if the real temerature is bellow 0 TMU reports a value above 0, it dose not make sense.

0 Kudos

1,288 Views
igorpadykov
NXP Employee
NXP Employee

Hi Wen

software causes the CPU reset, not hardware. Please check in

software bit V - Valid measured temperature and disregard in software such data.

As for hardware, recommended to check if proccesor voltages, ripples (should be <5%)

comply with datasheet requirements as -40C.

Best regards
igor

0 Kudos

1,288 Views
wenfu
Contributor III

Hi igorpadykov

My kernel version is linux-4.14.78(imx_4.14.78_1.0.0_ga, 94da7bdc489ba686d868bcf80678a37cae22673e).

Hi have dived into the TMU code and found that the error value was reported here:val = tmu_read(data, &data->regs->site[data->sensor_id].tritsr),in linux-4.14.78//drivers/thermal/qoriq_thermal.c. The val here fetched from tmu_read is the data of the hardware register.

So I made a trick to skip this value, send 15C to upper lever and print val, the TMU register value, to console.  Then we do the high-low temperature cycle testing in incubator again. This trick can make the board run good when temperature go down to -40C. BUT at the same time point, 112 appers, and it seems the TMU was stuck, it report 112 all the time no matter the real temperature the incubator provide. 

If the temperature go up, the TMU report the correct value it will make sense. But it's not.TMU made an irreversible mistake.

So I have to say this is not a software issue that can be skiped, or we will lose the kernel thermal protecting system function and TMU is useless.

Also, we have checked the power supply condition. up to now, I think it is stable enough.

 

The following diff is what I have done:

diff --git a/drivers/thermal/qoriq_thermal.c b/drivers/thermal/qoriq_thermal.c
index c832ee8..a1aebb5 100644
--- a/drivers/thermal/qoriq_thermal.c
+++ b/drivers/thermal/qoriq_thermal.c
@@ -114,7 +114,9 @@ static int tmu_get_temp(void *p, int *temp)
struct qoriq_tmu_data *data = p;

val = tmu_read(data, &data->regs->site[data->sensor_id].tritsr);
- *temp = (val & 0xff) * 1000;
+ //*temp = (val & 0xff) * 1000;
+ printk("temp=%d\n",(val & 0xff));
+ *temp = (15 & 0xff) * 1000;

return 0;
}

the output, we are play mp4 the same time.

[ 9372.691651] temp=0
15000
0:00:39.3 / 0:00:39.3
Reached end of play list.
Total showed frames (943), playing for (0:00:39.312561419), fps (23.987).

[ 9373.736590] temp=0
15000
[ 9374.517565] temp=0
[ 9374.767031] temp=0
15000
[ 9375.807120] temp=0
15000
[ 9376.533605] temp=0
[ 9376.851154] temp=0
15000
Press 'k' to see a list of keyboard shortcuts.
Now playing /root/4ktest.mp4
Prerolling...
====== AIUR: 4.4.4 build on Feb 7 2019 14:09:23. ======
Core: MPEG4PARSER_06.15.02 build on Nov 2 2018 10:35:58
file: /usr/lib/aarch64-linux-gnu/imx-mm/parser/lib_mp4_parser_arm_elinux.so.3.2
------------------------
Track 00 [video_0] Enabled
Duration: 0:00:39.288333000
Language: und
Mime:
video/x-h264, parsed=(boolean)true, alignment=(string)au, stream-format=(string)avc, width=(int)4096, height=(int)2304, framerate=(fraction)24000/1001, codec_data=(buffer)01640033ffe1001767640033acb400800090d0800001f480005dc0078c195001000468ce0bcb
------------------------
====== VPUDEC: 4.4.4 build on Feb 7 2019 14:09:23. ======
wrapper: 3.0.0 (VPUWRAPPER_ARM64_LINUX Build on Feb 7 2019 14:09:23[ 9377.435601] alloc_contig_range: 892 callbacks suppressed
)
vpulib: 1.1.1
firmware: 1.1.1.0
[ 9377.435607] alloc_contig_range: [54a00, 55a01) PFNs busy
[ 9377.464016] alloc_contig_range: [54b00, 55b01) PFNs busy
[ 9377.502412] alloc_contig_range: [54c00, 55c01) PFNs busy
[ 9377.508853] alloc_contig_range: [54d00, 55d01) PFNs busy
[ 9377.541848] alloc_contig_range: [54e00, 55e01) PFNs busy
[ 9377.548337] alloc_contig_range: [54f00, 55f01) PFNs busy
------------------------
Track 01 [audio_0] Enabled
Duration: 0:00:39.310000000
Language: und
Mime:
audio/mpeg, mpegversion=(int)4, channels=(int)2, rate=(int)44100, bitrate=(int)191992, stream-format=(string)raw, codec_data=(buffer)12100000000000000000000000000000
------------------------
[ 9377.769700] fsl-sai 30050000.sai: ASoC: can't open platform 30050000.sai: -6
[ 9377.881055] fsl-sai 30050000.sai: ASoC: can't open platform 30050000.sai: -6
AL lib: (EE) ALCplaybackAlsa_open: Could not open playback device 'default': No such device or address
[ 9377.898990] temp=112
15000

====== BEEP: 4.4.4 build on Feb 7 2019 14:09:23. ======
Core: AAC decoder Wrapper build on Dec 7 2017 18:13:51
file: /usr/lib/aarch64-linux-gnu/imx-mm/audio-codec/wrap/lib_aacd_wrap_arm_elinux.so.3
CODEC: BLN_MAD-MMCODECS_AACD_ARM_03.09.00_ARMV8 build on Sep 20 2017 15:02:50.
[ 9378.357132] fsl-sai 30050000.sai: ASoC: can't open platform 30050000.sai: -6
Cannot connect to server socket err = No such file or directory
Cannot connect to server request[ 9378.396762] fsl-sai 30050000.sai: ASoC: can't open platform 30050000.sai: -6
channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
AL lib: (EE) ALCplaybackAlsa_open: Could not open playback device 'default': No such device or address
[ 9378.477464] alloc_contig_range: [56100, 570c1) PFNs busy
[ 9378.483800] alloc_contig_range: [56200, 571c1) PFNs busy
[ 9378.514882] alloc_contig_range: [56200, 572c1) PFNs busy
[ 9378.538579] alloc_contig_range: [56400, 573c1) PFNs busy
[ 9378.549634] temp=112
[ 9378.943250] temp=112
15000
[ 9379.979379] temp=112
15000
[ 9380.565671] temp=112
[ 9381.031229] temp=112
15000
[ 9382.059449] temp=112
15000
[ 9382.581749] temp=112
[ 9383.095829] temp=112
15000
[ 9384.162507] temp=112
15000
[ 9384.601780] temp=112
[ 9385.203396] temp=112
15000
[ 9386.239800] temp=112
1500004.2 / 0:00:39.3
[ 9386.613783] temp=112
[ 9387.279760] temp=112
15000
[ 9388.346325] temp=112
15000
[ 9388.629812] temp=112
[ 9389.403998] temp=112
15000
[ 9390.445685] temp=112
15000
[ 9390.645849] temp=112
[ 9391.508864] temp=112
15000
[ 9392.547563] temp=112
15000
[ 9392.661883] temp=112
[ 9393.603900] temp=112
15000
[ 9394.639702] temp=112
15000
[ 9394.677912] temp=112
[ 9395.675488] temp=112
15000
[ 9396.693923] temp=112
[ 9396.727687] temp=112
15000

0 Kudos

1,288 Views
igorpadykov
NXP Employee
NXP Employee

Hi Wen

when temperature go down to -40C TMU should not be used,

you should disregard its data.

Best regards
igor

0 Kudos

1,288 Views
wenfu
Contributor III

Hi igorpadykov,

Can you make a experiment to confirm this issue and check if this problerm has been fixed upon the new silicon version?

B.R.

0 Kudos

894 Views
jack_mao
NXP Employee
NXP Employee

@wenfu ,

        I'm curious that in the ticket loop, the driver code you pasted in 4.14.78 is for 8m scale, not for 8mm, have you used the newer code such as 4.14.98 and later version to test this issue?  how about the result under -40c? Thanks!

 

best regards

Jack

0 Kudos