LPC1549 - debugging memory problem

dennis_frie · ‎03-10-2021

I've been struggling with a project in MCUXpresso recently, that has started to fail randomly.
I suspect it's a memory problem, but have a hard time narrowing it down.

As an example, one of the problems I've noticed, is the initialization of variables not being done correctly in the ResetISR.

double testFreq = 1000; // Variable used in actual firmware
volatile int32_t debugVar1 = 1234; // Debug purpose
volatile double debugVar2 = 1234; // Debug purpose

Without using the debug variables, I get the following readout when debugging and breaking the code just after the ResetISR routine that copies variables from flash to RAM (note, the testFreq variable is used in the code and should have been initialized):

testFreq = -nan(0xfffffffffffff)
debugVar1 = 33570816
debugVar2 = 9.2732874683871246e-312

Adding the following code to actually use the debug variables and the testFreq is also initialized correctly;

volatile int32_t dummy = (int32_t)debugVar1;
dummy = (int32_t)debugVar2;

This gives the expected initialization;

testFreq = 1000
debugVar1 = 1234
debugVar2 = 1234

It should be noted, that any random changes will cause the testFreq variable to be initialized correctly. However, other random errors or crashes will show.

So - the big question. I suspected the problems was caused by a pointer error or something similar that would corrupt random data. However, as this seems to be something that can be seen already in the ResetISR routine, I'm starting to wonder if where to look. Any good ideas ?

dennis_frie · ‎03-12-2021

As the problem seems to be isolated to the specific hardware prototype (the firmware is now tested on a handful of other hardware without problems) I've currently just "left it in a corner". We've used the LPC1549 for years and I've never seen this behavior before, so I'm fine with accepting this is just a device that has been damaged "somehow" after months of use in the RND lab. Unless we see this again, I don't think we will use too much time trying to understand what exactly has been damaged.

View solution in original post

frank_m · ‎03-11-2021

The first post sounded much like a stack overflow, an out-of-bound access, or a dangling pointer. The SCB fault registers would give you further information.

The second post sounds more like a setup problem. Perhaps the clock/Flash settings are not correct, and you operate the device beyond spec. That would explain the stochastic nature, i.e. some parts fail, some not.

Or, you have some hardware issue, probably related to the power supply.

dennis_frie · ‎03-12-2021

Well, that was exactly what I expected (problem with pointer or stack overflow). However, stack seemed to be well within limit and RAM usage shouldn't get close. A bad pointer could have explained it, however, as the problem seem to happen even with just the ResetISR routine being executed, that leaves pretty much nothing left to debug.

I guess a bad power-supply could explain it. It's supplied by a simple 3.3V LDO. It does output 3.3V, but I haven't checked with a scope to ensure everything looks good during its boot sequence.

frank_m · ‎03-12-2021

> I guess a bad power-supply could explain it. It's supplied by a simple 3.3V LDO. It does output 3.3V, but I haven't checked with a scope to ensure everything looks good during its boot sequence.

Proper buffering capacitors near the MCU are another requirement related to power supply.

However, I would not rule out setup issues. Very few manufacturers can run the Flash on their MCU with full speed, you will need to setup waitstates depending on clock frequency. Exceeding the limits here can create spurious behaviour as you observed, which differs between individual boards. I've got some experience with another vendor's Cortex M MCUs in this regard.

dennis_frie · ‎03-12-2021

True, but as this error can be seen just after the ResetISR routine, the MCU should be running low clock speed and with flash wait states well within accepted range. With a Microchip MCU I've seen problems with slow ramping supply voltage and MCU clock settings being set before the supply was within specs to run at max frequency - but in this case it already seems to happen before the clock change.

dennis_frie · ‎03-11-2021

Well, after days of debugging I finally tried another hardware prototype - and everything just works. I've never seen a MCU fail like this, but guess there's a chance some memory parts has been ESD damaged or something similar? I can't quite explain it, but so far the problem seems to be related to a defective MCU.

Live and learn..

diego_charles · ‎03-11-2021

Hi @dennis_frie

This sounds interesting, as far I understand your variables reside in the .data section.

In the MCU where you had those errors, what happens with the variables placed in the .bss? Are they initialized with zeros as expected?

Or , what if you place your variables in a custom area and test if they are initialized properly? Here is a tutorial with examples on how to do this using MCUXpresso.

https://community.nxp.com/t5/Kinetis-Design-Studio-Knowledge/Relocating-Code-and-Data-Using-the-MCUX...

Diego.

dennis_frie · ‎03-12-2021

As the problem seems to be isolated to the specific hardware prototype (the firmware is now tested on a handful of other hardware without problems) I've currently just "left it in a corner". We've used the LPC1549 for years and I've never seen this behavior before, so I'm fine with accepting this is just a device that has been damaged "somehow" after months of use in the RND lab. Unless we see this again, I don't think we will use too much time trying to understand what exactly has been damaged.

LPC1549 - debugging memory problem

LPC1549 - debugging memory problem

lpc15xx