S32K324 lockup when accessing Flash bank 1

codefather · ‎01-31-2024

Hi,

I am running a S32K324 with different applications on each CM7 core. Recently I stumbled upon a problem where quite randomly both CM7 cores suddenly hang (at the same time) and I am unable to attach a debugger. If I already have a debugger attached when the problem occurs then I simply lose connection to the debugger.

The problem is quite random and happens only on a few software builds and the time for it to occur is anything within an hour and up to several days. Seems to me like a race condition or some other weird timing dependent scenario. After a lot of troubleshooting I have concluded that it's likely the CM7_0 application that causes the problem, at least the CM7_1 is not causing it.

I also managed to get the CM7_1 to continue running fine after the CM7_0 causes the crash, the way I managed to do it was to move all the code and data used by CM7_1 into Program Flash block 0 (0x00400000-0x004FFFFF). The reason it worked and why I moved all the code and data used by CM7_1 into Flash block 0 is because I noticed that whenever the problem occurs I lose access to Flash block 1.
Accessing any address in Flash block 1 (0x00500000-0x005FFFFF) is fine before the problem happens. If I try to access Flash block 1 after CM7_0 has crashed then CM7_1 will also crash and lose connection with debugger. I can't regain connection until I reset the MCU.

When the crash happens I have tried to read out (from the CM7_1) various chip-level status registers in DCM, XBIC, PFlash etc but no luck so far, everything "seems" fine.

As I understood it AXBS has 3 slave ports (S0, S1, S4) that go to the PFlash controller, which in turn has 3 ports to access the memory blocks. Based on the information I could find in the reference manual I don't understand how I can lose access to Flash block 1 specifically. It seems to me like there is some deadlock either between AXBS and PFlash or between PFlash and the memory banks.

TLDR:
S32K324 CM7_0 randomly crashes and after the crash the CM7_1 is unable to access Flash block 1 (0x00500000-0x005FFFFF) without also crashing. The "crash" cause CM7_x to lose connection with debugger if already attached and it is not possible to attach without resetting the chip.

Qns:

Has anyone else faced any similar issue or has any idea what may cause the strange behaviour of losing access to only Flash block 1? (and any attempt to access it causes a deadlock/lockup of the core)

If anyone has an idea of things to test or check for further narrowing down the problem I would appreciate it a lot!

Best regards

lukaszadrapa · ‎02-14-2024

Hi Simon,

very well done! I went through the ARM documentation and I found this in the Cortex M7 technical reference manual:

This confirms that MPU is the right solution for this. We should never touch reserved areas in the memory map, so default configuration in S32DS project makes sense.

Regards,

Lukas

View solution in original post

codefather · ‎02-08-2024

Time for some updates:

I managed to reproduce the issue with ETM tracing on several times but unfortunately it didn't give me much to go on. The trace kind of "timeout" as the MCU crashes.
I can see that the timestamp between each trace increases significantly just before it stops, not sure if this timestamp is relevant at all or just a result of the lost connection and reading out the last things that might've been buffered in the Lauterbach µTrace HW.

The last trace is usually around the same code-line +- 50 code lines so it is somewhat consistent but seems like there is no actual issue around the code where the trace stops. Sometimes the trace ends with saying IRQx was called for example IRQ108 which is EMAC_3_IRQn (doesn't make sense to me), or IRQ172 which doesn't map anywhere for this chip (S32K324).

I did some other tests which interesting results, by performing an A/B swap I can alter some behaviour in the following way:
originally I observed that after the crash occurs I can't access flash bank 1 (0x500000-0x5FFFFF) from CM7_1 without also causing that core to crash and locking out debugger.
If I perform a bank swap the same behaviour goes but is instead shifted to bank 3 instead of bank 1,
namely I crash and lose debugger connection when trying to access flash bank 3 (0x700000-0x7FFFFF).

I have also played around trying to reproduce the issue with different binaries and mpu settings. I managed to reproduce the bug reliably for a specific firmware then I found a specific memory range which seemed to cause the issue to go away. After some manual work with padding here and there in the code and modifying the binary file directly I managed to produce 2 different binaries where one always crash within a few minutes whereas the other one never crash (have tried on multiple boards and for several days etc).
Now the interesting part is that the difference in the two binaries file (between the one that reliably crashes and the one that don't) is only
related to changing the value of 1 byte at a specific memory address. Furthermore this memory address is never used by the code, it is just code that sits in the memory but is never used.
I figured that something accidentally uses this memory, so I put an MPU region to cover this address and of course the error goes away (but I don't get any memory faults).
If I change the mpu region to protect the area just before this memory address (which is also not used), then the code crashes again.

I really don't understand what is going on anymore but I guess something more complex is happening which Im unable to grasp. (Some weird result due to speculative access or just pipeline magic).

I'm almost at a dead end of troubleshooting this further as I have already exercised the "brute force" approach in several ways.

Would appreciate if anyone has more ideas or some valuable insight.

lukaszadrapa · ‎02-13-2024

Hi @codefather

I was thinking about this a lot but it's really not easy to track this down.

I would try:

- disable optimizations

- disable flash prefetching

- disable cache

...

Of course, if some of this "solves" the problem, it can be side effect only. But maybe it could move you forward. Maybe the trace will show better result etc...

Regards,

Lukas

codefather · ‎02-14-2024

Hi again Lukas,

I found some more things which points to speculative access to unmapped memory causing this behaviour.

Here are my observations:

In my previous post I mentioned that I found a memory address which alters the behaviour of the crash.
As I previously mentioned this address is not used by the code and just sits in memory. I found that configuring the memory address as device memory (not allowing speculative read) makes the crashing disappear.
Changing the memory attribute to normal (allowed speculative reads) causes the crash to appear again.

Based on this my theory was that making this memory area allowed for speculative access alters the behaviour/algorithm of speculative access by the ARM core and causes the CPU to make another speculative access to some memory address which is actually causing the problem.

So I tried to add small MPU entries throughout the whole "unmapped" memory map until I found one that caused the issue to go away.
I went through the whole unmapped memory space and the only memory range which causes the crash to disappear and reappear is the region 0x1B100000-0x1B102000

If I configure the memory region above as device memory then the crash disappears, if I configure it as normal memory (allowing speculative access) then the crash reappears.

My conclusion is that (likely) a speculative access to somewhere in memory 0x1B100000-0x1B102000 causes misbehaviour by the chip and ultimately lead to a lock-up/deadlock of access to flash block 1 (0x500000-0x5FFFFF, 0x700000-0x7FFFFF if A/B swapped) as we have observed previously.
Note that the range 0x1B100000-0x1B102000 is where it seems like the speculative read occurs in my case and maybe the "illegal" region is larger than that.

Looking at the MPU setup generated by S32DS it looks like this memory range is not allowed speculative access through region 0 (cover whole unmapped memory space). We did not set up this region in our code project. After adding it I have not been able to reproduce the issue.

Region: 0
Description: Whole memory map
Start: 0x0
End: 0xFFFFFFFF
Size[KB]: 4194304
Type: Strongly Ordered
Inner Cache Policy: None
Outer Cache Policy: None
Shareable: Yes
Executable: No
Privileged Access: No Access
Unprivileged Access: No Access

Reading about speculative accesses it would kind of make sense to me based on the following statement:

"Addresses used by speculative accesses are not validated against the memory map of the device, and may attempt to also access non-existing memory regions or hardware elements having side effects."

Finally I tried to read from this memory range intentionally and in fact it causes the exact same behaviour as we have seen with the crashing. The core hangs and debugger connection is lost. Reading from other illegal memory ranges like 0xFFFFFFFF etc does not cause debugger disconnection, but instead results in usagefault or some other exception which seems more reasonable.
I narrowed it down and seems like the memory range that causes this behaviour is reading from addresses 0x1B100000-0x1B102000.

Do you see any reason for the unmapped memory space 0x1B100000-0x1B102000 to behave in this strange way?

Looking forward to hearing your feedback and thoughts.

Best regards

lukaszadrapa · ‎02-14-2024

Hi Simon,

very well done! I went through the ARM documentation and I found this in the Cortex M7 technical reference manual:

This confirms that MPU is the right solution for this. We should never touch reserved areas in the memory map, so default configuration in S32DS project makes sense.

Regards,

Lukas

codefather · ‎02-14-2024

Thanks for the support, glad we finally feel confident in the solution. Just out of curiosity, since the range that causes the issue is so small, 0x1B100000-0x1B102000 (just 8KB). May it be that this range is used by SBAF or something else internally?

lukaszadrapa · ‎02-14-2024

I'm sorry but I can't really say, I do not have this information. I would have to ask design team for details and I don't think they would be willing to share it...

lukaszadrapa · ‎02-02-2024

Hi @codefather

not sure what's going on, I have never met anything similar.

Could you check if the flash read wait states are configured accordingly to your core clock? It's in the CTL register in flash.

Do you program the flash in runtime?

Regards,

Lukas

codefather · ‎02-08-2024

Hi again Lukas,

In reponse to your previous message I tried to check what happens if I write/erase in the same flash bank as I'm executing from.
As expected it does not lock out the debugger as in the issue I'm seeing.
If I do read while write from same flash memory I end up stuck in hardfault handler, I guess if I somehow managed to erase hardfault handler I would end up in nmi handler.
If I also managed to break nmi somehow the core should go to lockup state if I understood it correctly. I really doubt this is the case.
There are also some registers in the S32K324 DCM to check if any CM7_x core is in lockup state and it is not the case for me.

Best regards

codefather · ‎02-02-2024

Hi Lukas, thanks for your response.

I have checked that RWSC = 4 both during startup and after the crash occurs (from CM7_1 core).

We only program the dataflash but I could try to read while writing to the same bank, however I don't expect it would cause a crash and lockout of debugger.

I am trying to get more info with ETM trace now but no luck yet in reproducing the issue while the trace is active.