We are seeing random Illegal- 1010- and 1111-exception on what i expect to be perfectly legal instructions with a 68EN360
the code is running out of FLASH (29F800 or similar)
In our exception handler I read the instruction at the faulted program counter so I can compare. It always matches.
I considered a wrong jump to the middle of a 3-word instruction, but I have also seen it in the middle of a series of several simple 1-word instructions (Simple register moves, register arithmetic etc.). I have also seen F-line emulation exceptions on instructions that don't begin with Fxxx.
Additionally oddly, the problem appeared after a SW update, disappeared after a later update, and then came back again with the update after.
We have had the occassional buserror due to bad pointers and very obvious illegal instructions due to a return from a corrupted stack, these we mostly identified the problem more or less immediately.
This one really has us stumped, and we are under strong pressure to fix it in our software.
I wonder if it is related to https://community.freescale.com/message/87784#87784
Any ideas please?
regards,
L di/dt
解決済! 解決策の投稿を見る。
Going through my old mails.
Update:
The FLASH manufacturer and DIMM socket changed. All within spec but just enough to expose a PCB design/layout problem:
There was ringing on the address lines, the small difference in FLASH manufacturer and DIMM socket change influenced the ringing the tiniest bit.
Just by bad luck, certain address sequences with certain address lines changing '1' to '0' resulted in the FLASH latching a wrong address (Setup/stabilization time violation)
PCB designer added series impedance matching resistors, just a couple ohms and fine-tuned them for our layout to make the signals look nice. Cleaned up some dodgy routing, a couple lines were really long compared to the others. The actual layout change was pretty small.
I have no idea how much this cost us in the end.
Lesson learned: Even for as slow as a 33MHz CPU, careful layout and impedance matching is important. Making a layout and simple check that 'it works' isn't enough.
This problem is now closed.
My normal sig is oddly appropriate.
regards
- L di/dt.
Rob, how is it going?
Keep us posted please! :smileyhappy:
Best regards!
Going through my old mails.
Update:
The FLASH manufacturer and DIMM socket changed. All within spec but just enough to expose a PCB design/layout problem:
There was ringing on the address lines, the small difference in FLASH manufacturer and DIMM socket change influenced the ringing the tiniest bit.
Just by bad luck, certain address sequences with certain address lines changing '1' to '0' resulted in the FLASH latching a wrong address (Setup/stabilization time violation)
PCB designer added series impedance matching resistors, just a couple ohms and fine-tuned them for our layout to make the signals look nice. Cleaned up some dodgy routing, a couple lines were really long compared to the others. The actual layout change was pretty small.
I have no idea how much this cost us in the end.
Lesson learned: Even for as slow as a 33MHz CPU, careful layout and impedance matching is important. Making a layout and simple check that 'it works' isn't enough.
This problem is now closed.
My normal sig is oddly appropriate.
regards
- L di/dt.
> the code is running out of FLASH
Do you have enough RAM to compile the code to run at the RAM's address, and then at startup, copy from FLASH to RAM and execute from there?
This should isolate the problem to being either a "Read from FLASH" problem or a "Read from anywhere over that bus" problem.
If it fixes the problem and you do have enough RAM, you may be able to apply this as a semi-permanent solution.
Still, given the 68360 is ... OMG! 20 years old! The Data Sheet is dated 1993. This product must have worked at some time in the past. What has CHANGED recently?
Ah yes, you said:
> Additionally oddly, the problem appeared after a SW update,
So do a "binary chop" between the old and new versions and try to narrow down the minimum change that causes the problem. If that doesn't show anything obvious, start relinking the old and new code to (say) increasing 4k or 16k address boundaries and see if any of the shifted "good" ones go bad or the "bad" ones go good.
Print and compare the MAP files between the two versions and see if functions moved around, or something important crossed an address boundary it didn't before..DO consider the address ranges and address patterns where the code and data are. Profile the code and see where the "hot spot loops" are, and see if moving these "hot-spots" around (like linking them all at the start or end of FLASH changes the problems.
I assume this thing is on Ethernet. Load up the network. Basically ping the **** out of it. Run multiple Linux shells and then "ping -f" it from 5 or 20 shells simultaneously. If that activity makes if fail more often you've got a better test.
Tom
I wrote:
> Still, given the 68360 is ... OMG! 20 years old! The Data Sheet is dated 1993. This product
> must have worked at some time in the past. What has CHANGED recently?
That also means the Product must be pretty old, and the boards in use must be pretty old.
Has anybody checked to see if some of the boards were made in the dark days of 2001 and 2002 when the "Capacitor Plague" was in full swing?
http://en.wikipedia.org/wiki/Capacitor_plague
If you read the above excellent article, it gives details of "normal" capacitor lifetimes (as well as the ones that failed early due to the "Plague"). If some of your failing boards are 10-15 years old, the capacitors may have dried out and may not be working any more. Different versions of software may generate different patterns of operating current, and that together with the old capacitors may be causing the observed problems.
Tom
Read all the Errata for the CPU. Which mask are you using. A quick read of the B0 errata shows all sorts of corruption problems.
I've had similar problems like this with the MPC860, which had bugs with the cache and branch prediction that would corrupt branches once a week or so, but the 68000 shouldn't be complicated enough to have these sort of problems (it doesn't have a cache for starters).
That problem came and went with different software versions due to the code ALIGNMENT in memory. When we had a build that failed, adding one instruction (that moved all the rest of them down) would usually make the problem go away.
So try adding "NOPs" (or equivalent) at different parts of the code to "move it around" to see if that makes it go away. You should be able to do a "binary chop" to zero in on any sensitive area.
Try adding a wait-state to the FLASH cycles, and see if it makes any difference.
Heat, cold? High and low voltage levels?
Do you have any DMA devices that might be interfering with a bus cycle? Do you have any gate-array chips on the bus? Is is "burst mode" Flash?
Try to find something that makes it fail more often. is it related to Ethernet traffic for instance? CPM load?
When you get the "crash", examine all the CPU registers and then use that to prove that "how you though the CPU got there" is how it did get there. Look back on the stack and see what function it was in previously, and what previous functions have saved on the stack. Then see what that function does to registers and see if you can work out how far through that function it got before going off the rails. You might see a pattern in this.
Can you hang a logic analyser on the bus? That may be the only way to trace it to where it is failing. You don't need everything, just capturing the bottom 12 address bits might be enough for you to backtrack what happened.
Tom
Thanks, that's helpful.
The only DMA is handled by the on-chip peripherals.
The problem is observed on systems both with and without Ethernet, not really a lot of traffic.
Flash is standard, big name manufacturers, no burst.
A test version with max waitstates, timing relax set, delay wait between transfers at the max gave the same problem.
Problem observed also at normal room temperature.
Interestingly, it seems to like "F"s
I found a couple cases with absolute address in the instruction, something like "MOVE.B D0,(00012345)"
that gave a buserror with the fault address FFFF2345
(Not the real address)
The problems tend to be in the most heavily executed parts of SW
Changing the order of modules in the linker seems to affect the frequency of the problem an which module it appears in.
We did once have a CPM load problem, it caused SPI interrupts to be set in the module interrupt register but never asserted to the CPU in case of several simultaneous interrupts. that's solved though (additional check of the interrupt pending register after the last serviced interrupt).
> I found a couple cases with absolute address in the instruction, something like
>"MOVE.B D0,(00012345)" that gave a buserror with the fault address FFFF2345
Do you have 16-bit Flash bus or 32-bit Flash bus? If you're using 16 bit, then the addresses are being fetched in two cycles, high word first, and that one went wrong somehow.
if you have 32-bit bus, then the upper half got read badly.
The fact that it read all-ones is either an open bus with pullups (do you have pullups?) or it is has read an erased part of the FLASH. I'm guessing you don't have pullups, so I'm suspecting the latter.
The way that might happen is if one of the upper address bits is being sampled as "high" by one of the Flash chips, and it is reading a high (erased) memory location instead of the one you intended it to read. Strangely the other Flash got the right address.
Which seems to indicate you've got marginal levels and/or timing. The worst case is where all of the address bits have changed from all-zeros to all-ones, with the exception of one poor address bit that is trying to stay low. The combination of capacitive coupling (from all the other tracks) and inductive coupling, together with "ground bounce" on the driving chip (the CPU) can lift that low address above the threshold when the Flash chip latches that address.
So you want an increased timing margin between when the addresses transition and when the signal is sent to the Flash chips to start a new cycle.
You should also be looking for this sort of thing on the address bus with a good oscilloscope. You should also look for evidence of undershoots and overshoots, They can cause RAM and FLASH chips to go slightly insane, especially as the frequency of the "shoots", the number of pins they're happening too and the temperature increase.
if your code has changed to it is getting "busy" across an address boundary (0x07fffff0 to 0x08000000) and the addresses on the data bus are also "busy" it can cross a threshold and trigger this.
Do you have SERIES resistors in the address bus? Do you have them in the data bus?
The data bus transitions (remember the CPU, RAM and ROM alternate driving this bus) can also couple through to the address bus. What are the theoretical and measured/observed timing relationships between data bus transitions and address bus transitions? And the Control bus signals?
This chip has variable drive strength on some clock lines, but not on the data, address or control lines. Newer chips have this capability and it is very useful. I had bad RAM corruptions on an MCF5329 which only went away when I turned the drive strength down as the undershoots and overshoots were causing the SDRAM to go crazy.
Is your chip in BGA, PGA or QFP? What's your grounding and bypassing like?
Does this happen on multiple boards? Has it started happening with newly built hardware, or is new software triggering it on all boards?
Tom