Is that 79C Ambient, 79C on the top of the CPU or 79C Die Temperature? If you're at 79C Ambient, then the CPU may be running a lot hotter than that, and may be beyond the maximum temperature.
Read the following, which was a similar problem on an i.MX53. The Memory Technology is similar, so it is probably a similar problem.
https://community.nxp.com/thread/355451
In our case we found that some boards worked OK over a wide temperature range, while others failed at high and low temperatures. The "Good Ones Versus Bad Ones" depended CYCLICALLY on the Serial Number! This looked like a difference in the internal layer thicknesses in different parts of the large printed circuit Panel that had multiple boards on it. That probably gave different track impedances for the different boards.
That matches your symptoms, of "some boards are OK and others aren't".
We were able to make all boards reliable by changing the Drive Strength and impedance settings from "5" to "6". That wasn't helped by the manual which wrongly stated that "5" and "6" had the same "Drive Strength" due to an uncorrected cut-and-paste bug in one of the tables.
DDR2 and DDR3 Memory Systems are meant to be "recalibrated" at very frequent intervals. The Memory Controller is meant to be told (about once per second!) to measure various impedance and delay parameters and to then correct them.
This is probably critical in something like a Desktop computer where the memory is on multiple SIMs that are plugged into sockets on a long bus. On an embedded system with one or more memory chips situated right next to the CPU you can probably get away with not doing this at all, or only partially doing it.
That's the "theory", but I couldn't find anything in the Linux Port for our board that actually did that. Even the Bootstrap avoided doing a "Calibration". The boards are set up ONCE by the Bootstrap with fixed parameters, and then they don't change them. In your case (with it working when restarted from hot), it is likely that the Bootstrap runs "one calibration pass" when it starts, and then it stays the same from there. So a "hot start" sets the parameters appropriate to that temperature. But not if the temperature then changes.
The way to investigate this properly is to run the "Memory Stress Test" for your CPU and then graph "Maximum memory Frequency Versus Temperature". You can see one of those in the linked post. In that one the temperature is that measured on the top of the CPU package with a thermocouple. Yes, I went up to 110C!
Without running this sort of test on your design, using multiple boards and at multiple temperatures you have no idea of the "Margin" that you have available in your product. If you're running a 400 MHz Bus and it passes the test at 520 MHz at all temperatures, then you have a good margin. If the maximum frequency drops to 420 MHz (like it did for us) then there's no margin, and if it drops to 350 MHz then it certainly won't work.
The boards that failed the memory test at high temperatures failed at lower temperatures when running Linux. That's why you need a good "margin" on the memory test.
(Edit) And another complication with your setup is that it looks like something (probably the boot) runs one "calibration pass" on startup. So the memory setup depends on the temperature when that pass was run. So you should test them powered-on at 20C and then run tested at hot and cold temperatures. Then repeat powering up "hot" and running up and down from there. So instead of graphing "maximum clock speed versus temperature" as a 2D graph you need "maximum clock speed versus starting temperature versus test temperature" as a 3D graph.
You really want the system continuously calibrating the memory all the time. Good luck with that, and let us know if you find it.
I think that all the "Thermal Driver" lets you do is to measure the temperature of the CPU and make it available to "User Space" code. I don't think it has anything to do with controlling or running calibration passes on the memory system.
https://www.kernel.org/doc/Documentation/thermal/sysfs-api.txt
Here's 50 pages on how to calibrate your memory system. Someone had better have gone through all of this for your board.
https://www.nxp.com/docs/en/application-note/AN4467.pdf
The following may be of interest (and mentions changing temperatures). It is possible your Boot isn't setting up as advised:
Once the MMDC and DDR device have been initialized,
ZQ calibration should be placed in automatic mode
to ensure proper operation. If ZQ calibration mode
is not placed in automatic mode, silicon temperature
changes, such as the initial rise in temperature
due to start up, may degrade DDR electrical signals
and cause memory errors.
Tom