We are using the MCF54417, a 250MHz Coldfire CPU. We have been running its system clock at the maximum rate, 250MHz, using a 50MHz external clock. Using the internal PLL, this was multiplied up to 500 MHz, then divided down to 250MHz.
A certain percentage of the boards we've built have been "unstable," ESPECIALLY when we put them in our (45 degree C ambient) burn-in chamber. (We run all of our products 24 hours in this environment to try to shake out problem boards before we ship them.) We have not been able to identify exactly what goes wrong in the system, but the symptom is as if an exception occurred, which will trigger a reboot. When (possibly) similar exceptions have happened on my desk, my In-Circuit Debugger has shed NO light on the problem--my interrupt handler for spurious interrupts issues a "halt" instruction which *should* give me an opportunity to inspect the stack, etc. but instead the JTAG/BDM interface seems to go wonky until I issue a reset.
This is the first PCB we have manufactured using BGA packaged chips. Only the CPU and DDR2 RAM are in BGA packages. Since some boards exhibited problems more than others, we focused on our reflow parameters in manufacturing, guessing that we might have some poor solder connections on "extra" Ground or Vdd connections under the CPU. However, we recently did destructive testing on some of these troublesome boards and, excluding the first run of boards, their solder looked very good. We no longer believe that the issues we're seeing are related to the soldering of these BGA parts.
Given that we had a really tough time getting a working configuration for the DDR2 RAM, we wondered if the RAM's configuration, routing, or solder could be an issue. However, I believe we have effectively ruled it out: As with the CPU, we have examined its solder and it looks good. Our PCB layout software does NOT have trace-length analysis, but our hardware engineer has put fair effort into keeping them equal. But, most importantly: We can run the system in a mode that DOES NOT USE external RAM and it does fail in this mode. This is our "boot loader" code, and it uses the internal 64K SRAM for stack and variables. While the DDR2 RAM is *initialized* and tested once in this boot code, it is never accessed afterward. Units running boot code (thus not accessing external RAM) HAVE rebooted in our burn-in chamber.
We have several units built up that have shown issues in the burn-in chamber. For kicks, this week we loaded code into a couple of the most troublesome units, to run the entire system at half speed (125MHz) and found that they were now stable! We then changed the PLL registers so that the system is running at 95% speed (changing FBKDIV from 20 to 19, for 237.5MHz system clock), and a dozen previously troublesome units have now run over 16 hours with none of them showing issues.
Naturally, we are concerned that there is some terrible underlying problem and are not comfortable with the "solution" of running the system at just under its rated speed.
While in the "burn-in chamber" we have monitored the temperatures inside the case of our product. The highest temperature we've ever measured was 83C, and that was with the thermocouple taped to the CPU package. These CPUs are rated as being able to run at 85C (ambient).
These boards are using the latest die of the CPU ("0N51E").
Yes, we have read the errata on this CPU.
We have (a few of) these products in field tests right now, so this is an urgent issue for us.