During repetitive power cycling tests, we observed intermittent boot failures on the device. The issue has been preliminarily traced to a synchronous exception triggered during OS initialization, causing the system to hang. Details are as follows:
Failure Context:
CPU is stuck at the synchronous exception vector entry, with the PC pointer frozen.
PC, ELR_EL1, and FAR_EL1 all point to the exception entry address 0xFFFFFFFF802E6200 (VBAR_EL1 = 0xFFFFFFFF802E6000).
ESR_EL1 = 0x86000004 (EC=0x21, IFSC=0x04: Instruction Abort, Translation fault at level 0).
DDR memory content is unreadable via Lauterbach debugger.
Hypothesis:
The CPU enters a deadlock when attempting to jump to the synchronous exception handler (VBAR_EL1 + 0x200), likely due to invalid page table mappings for this address, causing recursive exceptions.
Open Questions:
How to identify the original exception trigger point (the initial faulting instruction)?
Why is DDR memory inaccessible via the debugger during this state?
Hello, @zyz
Thanks for your reply.
Seems the issue is not related with BSP from your description, and it is more likely a debug phase on your own OS, I feel sorry that it is difficult for us to analyze it without code/reproduced setup from our end, since it is found when testing with Vxworks, I suggest also consulting WindRiver for querying tips on the analysis.
I apologize for your inconvenience.
BR
Chenyin
Hello, @zyz
Thanks for your post.
Would you mind sharing us more details of the background and the steps for triggering such issue?
Is the test done on custom board or RDB/EVB? with S32G2 or G3?
The test seems is done on A53 side, is it based on BSP? which version?
You mentioned synchronous exception triggered during OS initialization, the OS here is Linux from BSP or others?
If the test is based on BSP, any modifications done from your side?
BR
Chenyin
Hi chenyin_h,
Thank you for your reply.
We tested this on our self-developed board with our own OS(vxworks).
At this preliminary stage, it appears that the issue lies within our initialization code.
Currently, we are narrowing down the problem by adding debug prints. However, since the issue has a low reproduction rate and involves extensive code.
we would like to consult on how to more efficiently pinpoint the source of such anomalies.