Unexpected CSU-Reset on i.MX 93

cstoidner · ‎06-19-2024

Hello,

we are using the i.MX93, mounted on our own custom specific base-board. Everything is working fine, but on some boards we have unexpected reboots, as described below:

   - when powering up the board, everything boots (u-boot -> kernel -> systemd)
   - shortly before the boot process is finished (some milliseconds before the linux user-login appears),
      the system reboots
   - the reboot occurs directly without any delay (seems not to be the watchdog)
   - during this unexpected reboot the u-boot shows "Reset cause: CSU (0x200)" and continues
   - after u-boot the kernel and systemd boot successfully and the system is running fine

That unexpected reboot happens exactly once after exactly each power-up.
After that unexpected reboot the system seems to be healed and works fine and stable.
Also a software reboot ("reboot" command on the linux bash) works as expected.

As I can see in the i.MX 93 reference manual, CSU means "Central Security Unit". And "Reset cause: CSU" shown by u-boot comes from the SoC's SRSR register. And for the SRSR register and it's CSU-bit the reference manual states: "Indicates whether the reset was the result of the csu_reset_b input.".

However, I cannot find more information about that "csu_reset_b input".

So my question is: What can lead to that kind of reset? What is that "csu_reset_b input"?

Just for note: We are using NXP BSP "Linux 6.1.55_2.2.0".

Thanks in advance for any help or information!

Regards,
Christoph

jameswalmsley-cpi · ‎08-12-2024

Hi @cstoidner

Which revision or type of i.MX93 are you using?
I've also encountered the CSU error, and am trying to debug and understand what is happening.

I've just tried the exact same bit-for-bit image, kernel dtb etc on the dual-core variant and the CSU goes away. On the single-core variant the CSU error gets triggered.

I've also tested multiple versions of imx Yocto with the same result.

Did you get any further with your issue?

Kind regards,

James

AldoG · ‎06-20-2024

Hello,

It is as specified in the reference manual as you have already read, csu_reset_b asserts when security violation happens.

It is named as an input but it just a way to say that it is triggered internally by a flag that it turns on when a security violation happens. Also, please note that CSU_TRIG_MODE=1, then the System resets once, even if the reset source remains asserted.

Best regards/Saludos,
Aldo.

cstoidner · ‎06-21-2024

Hi Aldo,

thanks for your fast response!

That "CSU_TRIG_MODE=1" explains the behavior we see.

However it is not clear to me what kind of "security violation" could lead to the "csu_reset_b". I mean, for usual software violations (e.g. accessing illegal memory regions) I expect an exception on kernel-level that kills just the according process. Or similar illegal access inside the kernel code would lead to a kernel panic.

Do you have some more hints or Information to make us able to identify what the reason for the "security violation" might be?

Thanks and regards,

Christoph

AldoG · ‎06-24-2024

Hello,

This violations are handled usually by the secure system, i.e. ELE owned TRDCs are protected and owned by ELE and any access to them will trigger security violation.

Please refer to the security reference manual.

Best regards/Saludos,
Aldo.

cstoidner · ‎07-01-2024

Hello Aldo,

following the documentation in the security reference manual the cause for the violation must be the Cortex-A55, or the Cortex-M33, or any other bus-master (i.E. some DMA-controller) that accesses an illegal memory/peripheral area. Did I git that right?

In our case the M33 has no SW-image and is not started. So I assume it cannot be the cause for the violation.

Then all that's left are the A55 and DMA.

Is there any chance to gather any more information about the violation that leads to the reboot? E.g.

who has caused the violation (e.g. which bus-master)
and what kind of violation occurred (e.g. which address was accessed?, read or write?, ...)

Or do you have another hint on how we can get to the bottom of the problem?

Thanks in Advance for your support,
Christoph

PS: Last but not least below your can find some information how we configure the iMX EdgeLock Enclave:

In our device-tree we have mailbox definition "s4muap" for the "i.MX Messaging Unit" from the standard "imx93.dtsi":

 s4muap: mailbox@47520000 {
         compatible = "fsl,imx93-mu-s4";
         reg = <0x47520000 0x10000>;
         interrupts = <GIC_SPI 31 IRQ_TYPE_LEVEL_HIGH>,
                      <GIC_SPI 30 IRQ_TYPE_LEVEL_HIGH>;
         interrupt-names = "tx", "rx";
         #mbox-cells = <2>;
 };

...

 ele_mu: ele-mu {
         compatible = "fsl,imx93-ele";
         mboxes = <&s4muap 0 0 &s4muap 1 0>;
         mbox-names = "tx", "rx";
         fsl,ele_mu_did = <3>;
         fsl,ele_mu_id = <2>;
         fsl,ele_mu_max_users = <4>;
         status = "okay";
 };

And the "ele_reserved" memory area from NXP commit
"1cef27794d6d LF-8071-1 arm64: dts: imx93: use a reserved mem-ranges to constrain ele-mu dma-range":


  reserved_memory: reserved-memory {
         ranges;
         #address-cells = <2>;
         #size-cells = <2>;

         ...

         ele_reserved: ele-reserved@a4120000 {
                 compatible = "shared-dma-pool";
                 reg = <0 0xa4120000 0 0x100000>;
                 no-map;
         };
  };

&ele_mu {
memory-region = <&ele_reserved>;
}

AldoG · ‎07-05-2024

Hello,

It is the TRDC's Memory Region Controller provides domain-based, hardware access control for all system bus references targeted at non-peripheral memory spaces.

For each region descriptor hit, the MRC logic evaluates the access rights defined by the MRCm_DOMd_RGDr_Ww registers.
Specifically, the domainID attribute selects the appropriate MRCm_DOMd_RGDr_Ww register to use in the access evaluation.

There is an access error for three conditions:
1. If the access does not hit in any region descriptor, an access error is reported.

2. If the access hits in a single region descriptor and that region signals a domain violation, then an access error is reported.

3. If the access hits in multiple (overlapping) regions and one region signals a violation, then an access error is reported.

The third condition reflects that priority is given to access denying over access allowing for overlapping regions. Unimplemented domain IDs (DIDs) do not have any associated region descriptors and therefore have no access rights.

All this information is available on the i.MX93 Reference Manual Chapter 22
Trusted Resource Domain Controller (TRDC), more especifically 22.3.6.2 Memory region access evaluation.

Best regards/Saludos,
Aldo.

cstoidner · ‎07-11-2024

Hello,

Ok, I can see in the reference manual there is the "domain error capture management", in section "22.6.3".
And as far as I understand, it captures for each violation the triggering address and some attribute information.
It seems to me, that is what I was searching for in my questions from above.

However, I couldn't find any software for that feature.
Is there maybe already some software implementation for this "domain error capture management"? And is that the way to read-out all information about a violating access, after a violation was detected?

Recall, my goal was to determine (1) who did the violating access and (2) what exactly was the violating access. All this is to identify and solve the reason for the "CSU reset" we see on some of our boards.

Thanks,
Christoph

cstoidner · ‎07-11-2024

> Recall, my goal was to determine (1) who did the violating access and
> (2) what exactly was the violating access. All this is to identify and solve
> the reason for the "CSU reset" we see on some of our boards.

To be more precise: I want to know: Who did access what, that lead to the CSU Reboot, we see on or boards.

AldoG · ‎07-18-2024

Hello @cstoidner,

Please help me with the following information:
What is the interval time between the board power up and unexpected reboot?
Could you share the log when unexpected reboot?

Best regards/Saludos,
Aldo.

cstoidner · ‎07-26-2024

Hello Aldo,

I got now log files with timestamps, to see the interval time between the board power up and unexpected reboot.

To assure that we did not accidentally look at a rare outlier we captured four boot-ups. I attached the logs of the four boot-up to that messages.

To calculate the interval times, I looked at three points in the log:
1) The 1st line of the SPL after power-up
2) The last line of the userland before the unexpected reboot occured
3) The 1st line of the SPL after the unexpected reboot

Here are the results:

1. boot log ("nash-bootlog0.txt")

1st line of SPL:              [2024-07-24 07:26:28.948] SOC: 0xa1009300
last line of userland:        [2024-07-24 07:26:38.425] [ 2.403166] mmcblk1: p1 p2
1st line of SPL after reboot: [2024-07-24 07:26:40.209] U-Boot SPL 2023.04 (May 29 2024 - 06:11:51 +0000)
=================================================
=> "1st SPL" to "last userland": 9s 477ms
=> "1st SPL" to "1st SPL after reboot": 11s 261ms

2. boot log ("nash-bootlog1.txt")

1st line of SPL:              [2024-07-24 07:29:09.805] SOC: 0xa1009300
last line of userland:        [2024-07-24 07:29:19.249] [ 2.395341] mmcblk1: p1 p2
1st line of SPL after reboot: [2024-07-24 07:29:19.443] U-Boot SPL 2023.04 (May 29 2024 - 06:11:51 +0000)
=================================================
=> "1st SPL" to "last userland": 9s 444ms 
=> "1st SPL" to "1st SPL after reboot": 9s 638ms

3. boot log ("nash-bootlog2.txt")

1st line of SPL:              [2024-07-24 07:30:50.419] SOC: 0xa1009300
last line of userland:        [2024-07-24 07:31:11.814] phyboard-nash-imx93-1 login: 
1st line of SPL after reboot: [2024-07-24 07:31:12.160] U-Boot SPL 2023.04 (May 29 2024 - 06:11:51 +0000)
=================================================
=> "1st SPL" to "last userland": 21s 395ms 
=> "1st SPL" to "1st SPL after reboot": 21s 741ms

4. boot log ("nash-bootlog3.txt")

1st line of SPL:              [2024-07-24 07:33:23.550] SOC: 0xa1009300
last line of userland:        [2024-07-24 07:33:40.306] Starting Weston, a Waylandāositor, as a system service...
1st line of SPL after reboot: [2024-07-24 07:33:40.545] U-Boot SPL 2023.04 (May 29 2024 - 06:11:51 +0000)
=================================================
=> "1st SPL" to "last userland": 16s 756ms 
=> "1st SPL" to "1st SPL after reboot": 16s 995ms

Does that Information help you to isolate the problem?
If you like to see more detail please consider the log file attached.

And if you need any more Information do not hesitate to contact me.

Thanks and regards,
Christoph

AldoG · ‎07-29-2024

Hello,

One more thing that I would like you to confirm, did the issue happen on custom board or EVK?
Per the logs shared seems that you're using your own design is this correct?

To confirm the CSU reset, can you try to set CSU_MASK in SRMASK register(address 0x44460018) and see if board still reset? The register only can access by secure world, so need to set in ATF or using JTAG.

To narrow down the CSU reset reason, can you try to further mask ELE reset source in register ELE_RESET_REQ_MASK (address 0x444F0130) and see masking which bit can prevent the board reset? Also need to access the register in secure world.

Thank you for your patience,
Best regards/Saludos,
Aldo.

cstoidner · ‎08-01-2024

Hello,

> One more thing that I would like you to confirm, did the issue happen
> on custom board or EVK?
> Per the logs shared seems that you're using your own design is this correct?

You are right, we are using our own custom designed board. On the NXP EVK I did not saw that issue. But it doesn't appear on all of our own prouced custom boards either. We have one custom board design, and the issue appears on about 5 to 10 out of 100 produced boards. And on those boards where it appears, it is not sporadicle but appears absoltely reliable (each 1st boot after power-up. Just to note: As you mentioned above the CSU_TRIG_MODE=1 probably prevents the reset for further boots).

> To confirm the CSU reset, can you try to set CSU_MASK in SRMASK register [...]
> To narrow down the CSU reset reason, can you try to further mask ELE reset source in register ELE_RESET_REQ_MASK [...]

I will do that. Just one more question to that: If I set the register via JTAG, is it enough to set them in context of u-boot, I mean: (1) stop u-boot with JTAG, (2) set the registers with JTAG, (3) continue u-boot execution with JTAG? Or is it too late when u-boot has already started, since ATF may already done?

Thank and regards,
Christoph

AldoG · ‎08-13-2024

Hello,

Sorry for the delayed response, team suspects for a possible hardware issue, so please create a support case so we can review your schematic, or if its public please share it here.

Regarding setting the register in context of u-boot, it should be fine, please try it and access the registers by JTAG in secure way.

Best regards/Saludos,
Aldo.

Feldmann

Hi @AldoG,

sorry for the delayed response too. I have further analysed the CSU reboot problem and investigated it on the hardware side.

We can now narrow down the problem. The CSU reboot only occurs on modules in the test adapter and under very specific circumstances. The CSU reboot behaviour has something to do with the connection of the test adapter to the XTAL_IN and XTAL_OUT signals of the PMIC PCA9451A.
We have not connected an external crystal to these two pins on our module. For this reason, as described in the data sheet, we lead XTAL_IN via a 100kOhm resistor to GND and XTAL_OUT leave floating.
These signals are also routed to the test adapter to check the connection. For test purposes, we disconnected the connection from the test adapter to XTAL_IN and did not see a reboot.
I would be happy to send you extracts of our circuit diagrams via a private channel or e-mail.
Do you have an explanation as to why this CSU reboot occurs in connection with the XTAL_IN and XTAL_OUT signals of the PMIC? Is such a problem known?
In the data sheet, XTAL_IN states: "32.768 kHz crystal oscillator input, tie to GND if Xtal is not used."
What does "tie to GND" mean in this case? Does this signal have to be connected hard and directly to GND, or should this be done via a pull-down resistor?

Best regards

Benedikt

cstoidner · ‎07-19-2024

Hi Aldo,

> What is the interval time between the board power up and unexpected reboot?
> Could you share the log when unexpected reboot?

I asked my colleague who can reproduce the problem to capture a log with timestamps. As soon as I have the log available I will give you the interval time and shared the log with you.

Regards,
Christoph

AldoG · ‎07-16-2024

Hello,

Indeed I see your point, I'm checking internally wheter there is something that it is not documented or if there is some part of code that may be causing the reset on your design.

Best regards/Saludos,
Aldo.