We are using the MCF54417, a 250MHz Coldfire CPU. We have been running its system clock at the maximum rate, 250MHz, using a 50MHz external clock. Using the internal PLL, this was multiplied up to 500 MHz, then divided down to 250MHz.
A certain percentage of the boards we've built have been "unstable," ESPECIALLY when we put them in our (45 degree C ambient) burn-in chamber. (We run all of our products 24 hours in this environment to try to shake out problem boards before we ship them.) We have not been able to identify exactly what goes wrong in the system, but the symptom is as if an exception occurred, which will trigger a reboot. When (possibly) similar exceptions have happened on my desk, my In-Circuit Debugger has shed NO light on the problem--my interrupt handler for spurious interrupts issues a "halt" instruction which *should* give me an opportunity to inspect the stack, etc. but instead the JTAG/BDM interface seems to go wonky until I issue a reset.
This is the first PCB we have manufactured using BGA packaged chips. Only the CPU and DDR2 RAM are in BGA packages. Since some boards exhibited problems more than others, we focused on our reflow parameters in manufacturing, guessing that we might have some poor solder connections on "extra" Ground or Vdd connections under the CPU. However, we recently did destructive testing on some of these troublesome boards and, excluding the first run of boards, their solder looked very good. We no longer believe that the issues we're seeing are related to the soldering of these BGA parts.
Given that we had a really tough time getting a working configuration for the DDR2 RAM, we wondered if the RAM's configuration, routing, or solder could be an issue. However, I believe we have effectively ruled it out: As with the CPU, we have examined its solder and it looks good. Our PCB layout software does NOT have trace-length analysis, but our hardware engineer has put fair effort into keeping them equal. But, most importantly: We can run the system in a mode that DOES NOT USE external RAM and it does fail in this mode. This is our "boot loader" code, and it uses the internal 64K SRAM for stack and variables. While the DDR2 RAM is *initialized* and tested once in this boot code, it is never accessed afterward. Units running boot code (thus not accessing external RAM) HAVE rebooted in our burn-in chamber.
We have several units built up that have shown issues in the burn-in chamber. For kicks, this week we loaded code into a couple of the most troublesome units, to run the entire system at half speed (125MHz) and found that they were now stable! We then changed the PLL registers so that the system is running at 95% speed (changing FBKDIV from 20 to 19, for 237.5MHz system clock), and a dozen previously troublesome units have now run over 16 hours with none of them showing issues.
Naturally, we are concerned that there is some terrible underlying problem and are not comfortable with the "solution" of running the system at just under its rated speed.
While in the "burn-in chamber" we have monitored the temperatures inside the case of our product. The highest temperature we've ever measured was 83C, and that was with the thermocouple taped to the CPU package. These CPUs are rated as being able to run at 85C (ambient).
These boards are using the latest die of the CPU ("0N51E").
Yes, we have read the errata on this CPU.
We have (a few of) these products in field tests right now, so this is an urgent issue for us.
Oops! I just read your post properly after writing the following. So you're not having DDR2 problems then. I'll leave this in in case it helps anyone else.
We had a similar problem with the i.MX53 chip. Some boards wouldn't work at temperature extremes. Other ones would:
The above graphs the maximum memory bus frequency that the Freescale Memory Stress Test would run at versus temperature for two different boards. The rated DDR3 frequency for the i.MX53 is 400MHz.
The Red "A6 MHz DSE5" line is for a board with a partial hex serial number of "A6", showing temperature sensitivity. The purple "A2 Mhz DSE5" is the same test on a different board.
After testing a lot of boards we found the temperature sensitivity seemed to varying in a CYCLIC MANNER directly and linearly based on the board serial number.
Huh?
The boards are made on large PCB panels, and are serialised (something like) left to right, top to bottom. The panels are "impedance controlled", which means they have an impedance test area in one or two places. But the actual impedance and (more importantly) the thickness of the inner layers of the board probably varies across the board. If we knew what the variation was and knew how to measure it we could probably draw a "contour map" on the panels that corresponded to the above measured sensitivity.
You'll notice that boards "A2" and "A6" both work fine at "DSE7". That's the DRIVE STRENGTH setting of the DDR Address and Control Lines.
The Reference Manual documents that DSE5 and DSE6 are identical. It is wrong. We ended up changing the Drive Strength for these signals to "6" and it fixed the problems.
On your chip, what have you got "MSCR_SDRAMC" set to? I'd suggest not believing the "reserved" values and trying them anyway.
Of course that has nothing to do with it at all.
I'd suggest you get a "particularly sensitive" board and replace the crystal with an oscillator, or (better) feed it from a signal generator. That should eliminate any possible crystal problems. 50MHz for a crystal is "pushing it" and may be causing the problems. Then I'd suggest using a different crystal (like 25MHz) and multiplying that up instead to see if that might be causing it. Have you performed a standard "Crystal Margin Test" over Temperature?
Some Freescale chips have errata warning that the input voltage from an external oscillator has to be reduced to less than the full 0V to 3.3V swing
But we have had another similar problem with temperature and the bus. Again it was the Drive Strength of the Data Bus to the FLASH and RAM chips that were the problem. Dropping the Drive Strength (to reduce the overshoots and undershoots) fixed this problem.
https://community.freescale.com/message/66171#66171
> In-Circuit Debugger has shed NO light on the problem
It would be a good idea deliberately casing every possible trap and making sure it can catch them all. Make sure you have the Vectors and Stack in SRAM first. Put the same vector into every unused interrupt, especially the "unused" or "spurious" ones for your interrupt controller. Make sure you trap A-Line and F-Line exceptions. We missed those and had all sorts of problems as a result when the CPU hit a null function pointer and tried to execute the reset vector - which started with an 0xf.
If the debugger won't catch the supposed exception, write a very simple "blinking light" or "dumb serial port" trap handler and have the fault trigger it. Have it disable MMU, interrupts, DMA and everything else you can think of to maximise the chances of it reporting back to you. Are you limiting the JTAG Clock to 3MHz or less?
Tom