SCC UART Only Works Below 50 Degrees C

nessies2003 · ‎07-14-2012

I am running SCC UART code on an MPC8270. Our UART TX code works fine until we head the processor up to about 50 degrees C. After that, the SCC2 BDs status bits seem to get corrupted. We are checking the Ready bit of the BD before we overwrite it, so when it gets corrupted, the code thinks that the buffer hasn't been sent yet. Sometimes, it is not sent. It seems that some messages are skipped over at this temperature, and then sent out (out of order) once the transmit buffer pointer wraps around.

Has anyone seen issues like this and/or does anyone have any ideas where we can look to start investigating this?

Thanks.

genuap · ‎07-15-2012

Where's your BD? In external SDRAM? I'd guess an SDRAM timing issue?

Have you tried to just run an extensive memory test (either through code, or an emulator) on uncached DRAM at low temp?

Try moving your BD to DPRAM? See if that makes a difference - if so, it points to DRAM.

Also, make sure you have an eieio or sync between checking the ready bit and overriting - without the explicit synchronization those could happen out of order. Writes happen quicker when cold - so that could be exaccerbating it.

... Paul

nessies2003 · ‎07-20-2012

Our BD is in dual port internal RAM. Our Buffers themselves are in external ram.

We have run extensive memory testing of the external ram at low and high temp and we did not see any problems.

I tried the eieio assembly command between the read and the set and that does not seem to help. The status bits are still getting corrupted.

TomE · ‎07-23-2012

Where is the UART being clocked from, internal or external?

Try dropping the clock rate to see if it is related to a "corner" condition.

Check your power supplies at that temperature.

If you're using crystals then they may be glitching at that temperature (technical term "spurs"). Try changing the loading caps slightly and see if the problem changes.

Read the text describing Figure 22 on Page 677 of the following:

http://www.ieee-uffc.org/frequency_control/teaching/pdf/fcdevices.pdf

If you read through all of the above you'll never take a crystal for granted ever again!

Does this happen on one hardware unit only or all of them? Or do you only have one of them?

You're "heating the processor up to 50C" but do you know how hot the CPU really is?

Tom

nessies2003 · ‎07-23-2012

Looking at the registers, it appears we are clocking internally using BRG1. I changed the settings of BRGC1 to have a slower baud rate, but I'm still seeing status bits become corrrupted in the BDs. Furthermore, the FCC ethernet (which also seems to have the same issue) uses external clocks.

The MPC8270 clock itself seems to be generated by an oscillator chip which gets fed into a PLD and then into our MPC8270.

The original person who discovered the issue was using the temperature chamber and noticed that we lost RS232 output when the chamber was at 50 deg C. I have been using a heat gun on the processor itself, so as you suspected, I cannot confirme exactly how hot the device is, but we do have the data indicating that it failed in the chamber at 50 deg C.

We have tried this on 2 of our boards with the same results.

Looking at the 3.3V and 1.5V power supplies, I don't see any changes when I heat up the processor.

TomE · ‎07-24-2012

> I cannot confirme exactly how hot the device is

You may be exceeding the maximum allowed junction temperature. Is that 50C for a blank board or one in a box of some sort?

Looking at "MPC8280 PowerQUICC II Family Hardware Specifications", the absolute maximum junction temperature is 120C and the recommended maximum is 105C.

Junction to Ambient without forced cooling for the TBGA is 12C/W. The PBGA is 19 C/W.

If you're running the CPU in PBGA at 300MHz it dissipates up to 1.65W without considering I/O power.

5 Power DissipationThis table provides preliminary, estimated power dissipationfor various configurations. Note that suitable thermalmanagement is required to ensure the junction temperaturedoes not exceed the maximum specified value. Also notethat the I/O power should be included when determiningwhether to use a heat sink.

Are you using a heat sink? I would suggest you follow the procedure in Section 4.5 of the Hardware Specification and measure how hot the CPU is getting.

What is the maximum ambient temperature you want your product to be able to run at?

Here's another one I've heard of before. Make sure the clock you're feeding into the CPU doesn't have any overshoots or undershoots that exceed the maximum voltage ratings (-0.3V to 3.6V). If you don't have a series resistor (50 - 100 ohms) in series with the clock pin it will be very likely to have over and undershoots. When the CPU temperature goes up it gets more sensitive to this.

Tom

TomE · ‎07-25-2012

Check the external data buses for overshoots and undershoots too. Can you change the code to stop it exercising individual external buses to see if the problem goes away (and so can be isolated to one bus)?

Tom

nessies2003 · ‎07-27-2012

Update: We cut the input clock in half and don't have the problem. Next I will try scoping the two clocks for comparison...

TomE · ‎07-28-2012

> The MPC8270 clock itself seems to be generated by an oscillator chip which gets fed into a PLD and then into our MPC8270.

Watch out for the duty cycle or mark/space ratio, and also the jitter then. The PLD may be distorting the clock. The distortion may change with temperature, or the MPC8270's sensitivity may be changing:

From the MPC8280EC:

NOTE: CLKIN Jitter and Duty Cycle
The CLKIN input to the SoC should not exceed +/– 150 psec of jitter
(peak-to-peak). This represents total input jitter—the combination of short
term (cycle-to-cycle) and long term (cumulative). The duty cycle of CLKIN
should not exceed the ratio of 40:60.

Check the specs for all relevant clock sources.

Tom

nessies2003 · ‎07-28-2012

Spoiler

Here's something else interesting. When we change the PLLMF multiplier so that the CPM multiplication factor is half what it was before, but the CPU is the same, and keep our original clock, the problem seems to go away.

TomE · ‎07-29-2012

> When we change the PLLMF multiplier so that the CPM multiplication factor is half what it was before

From your original symptoms it was always a CPM problem and not a CPU problem.

You haven't said what your input clock frequency is, what mode it is in (Local, PCI Host, PCI Agent), what the speed rating of the CPU is (66/83/100), the speed you're running the bus at and the multiplication factors for the CPM and CPU. You may be running outside of the specifications. Providing these numbers might help with spotting something.

Have you checked the chip errata? Have you checked the PLL supply?

From the tables it looks like the CPU always runs at a faster clock than the CPM, but I couldn't find anything in the manual advising of any limits on the allowable ratios. Have you found anything on this?

Tom

nessies2003 · ‎07-30-2012

We are using 100 MHz input clock. We are using local mode. The spedeThe CPM and CPU were orignaly having multiplication factors of 4x. Looking at the chip the CPM max frequency is 300 Mhz, so it seems we were in fact overclocking....

nessies2003 · ‎07-30-2012

Still having some issues setting the HRCW to make the CPM x2 and CPU x4. For some reason when I set this to 0001_000 I get 5x and 5x. Any ideas?

TomE · ‎07-31-2012

HRCW??

Starting up these things is really complicated, isn't it?

"As described in Section 10.6, “Clock Configuration Modes,” the
main PLL locks according to MODCK[1–3], which are sampled,

and to MODCK_HI (MODCK[4–7]) taken from the reset configuration word."

And the "reset configuration word" gets read from an EEPROM. So I guess you're trying to reprogram that EEPROM to load the HRCW with the right values that then make it into HRC, SCMR[PLLMF] and SCMR[CORECNF] to see what you ended up with.

Bytes 0x00, 0x08, 0x10 and 0x18 are read and loaded in big-endian order into HRCW. The lower four bits of the last byte should end up in MODCK_H

You might be suffering from one of the Power architecture's standard confusing problems. The Bit and Byte order. Because this derives from 1950's IBM bit numbering standards, the most significant bit is bit zero.

Make sure you haven't set the MODCK_H bits to "1000" instead of "0001".

"MODCK_H = hard reset configuration word [28–31].", so "0001" needs an EEPROM byte of 0x01 and not 0x08.

You could probably get by with one of the default modes (without an EEPROM), and this might be worth trying to debug this problem.

Tom

nessies2003 · ‎07-31-2012

Sadly I have tried 0x01 and 0x08 and neither seems to work properly. The default mode 0000_000 does work properly, but I need the equivelent configurations to work WITH an EEPROM.

We may just have to stick with 0110_000 which seems to work.

TomE · ‎08-02-2012

There's obviously something wrong with your EEPROM programming.

This should be very easy to reverse-engineer.

Try different combinations of clock-selection data - maybe all sixteen values from 0000 to 1111, and for those where the CPU manages to run, examine the hardware configuration register and the resulting clock dividers and see if the patterns in the data show you the mistake.

You may be programming the wrong bytes/words in the EEPROM, so try programming some of the other bits and see if they end up where you expect them to be.

Tom

nessies2003 · ‎08-02-2012

I actually did try that (0x0 to 0xF). The thing is, I only found 3 combinations that worked. I could not generate the other 3 configurations with xxxx_000 that are listed in theTable 16. The other 13 combinations of HRCW_H gave me either duplicate configurations for the ones that did work, or undefined configurations (such as 1.5x CPM).

Furthermore, I must be using the correct bits in the HRCW because the SCMR registers do change. Addtionally, the HRCW_H of 0x00 should work for sure since there are no 1's and therefore the position of the bits doesn't matter, however, this doesn't work.

Regardless, we have come to a configuration we are satisfied with, although I do wish I understood what was going on here.