Unexpected CSU-Reset on i.MX 93

cstoidner

Hello,

we are using the i.MX93, mounted on our own custom specific base-board. Everything is working fine, but on some boards we have unexpected reboots, as described below:

   - when powering up the board, everything boots (u-boot -> kernel -> systemd)
   - shortly before the boot process is finished (some milliseconds before the linux user-login appears),
      the system reboots
   - the reboot occurs directly without any delay (seems not to be the watchdog)
   - during this unexpected reboot the u-boot shows "Reset cause: CSU (0x200)" and continues
   - after u-boot the kernel and systemd boot successfully and the system is running fine

That unexpected reboot happens exactly once after exactly each power-up.
After that unexpected reboot the system seems to be healed and works fine and stable.
Also a software reboot ("reboot" command on the linux bash) works as expected.

As I can see in the i.MX 93 reference manual, CSU means "Central Security Unit". And "Reset cause: CSU" shown by u-boot comes from the SoC's SRSR register. And for the SRSR register and it's CSU-bit the reference manual states: "Indicates whether the reset was the result of the csu_reset_b input.".

However, I cannot find more information about that "csu_reset_b input".

So my question is: What can lead to that kind of reset? What is that "csu_reset_b input"?

Just for note: We are using NXP BSP "Linux 6.1.55_2.2.0".

Thanks in advance for any help or information!

Regards,
Christoph

AldoG

Hello,

It is as specified in the reference manual as you have already read, csu_reset_b asserts when security violation happens.

It is named as an input but it just a way to say that it is triggered internally by a flag that it turns on when a security violation happens. Also, please note that CSU_TRIG_MODE=1, then the System resets once, even if the reset source remains asserted.

Best regards/Saludos,
Aldo.

cstoidner

Hi Aldo,

thanks for your fast response!

That "CSU_TRIG_MODE=1" explains the behavior we see.

However it is not clear to me what kind of "security violation" could lead to the "csu_reset_b". I mean, for usual software violations (e.g. accessing illegal memory regions) I expect an exception on kernel-level that kills just the according process. Or similar illegal access inside the kernel code would lead to a kernel panic.

Do you have some more hints or Information to make us able to identify what the reason for the "security violation" might be?

Thanks and regards,

Christoph

AldoG

Hello,

This violations are handled usually by the secure system, i.e. ELE owned TRDCs are protected and owned by ELE and any access to them will trigger security violation.

Please refer to the security reference manual.

Best regards/Saludos,
Aldo.

cstoidner

Hello Aldo,

following the documentation in the security reference manual the cause for the violation must be the Cortex-A55, or the Cortex-M33, or any other bus-master (i.E. some DMA-controller) that accesses an illegal memory/peripheral area. Did I git that right?

In our case the M33 has no SW-image and is not started. So I assume it cannot be the cause for the violation.

Then all that's left are the A55 and DMA.

Is there any chance to gather any more information about the violation that leads to the reboot? E.g.

who has caused the violation (e.g. which bus-master)
and what kind of violation occurred (e.g. which address was accessed?, read or write?, ...)

Or do you have another hint on how we can get to the bottom of the problem?

Thanks in Advance for your support,
Christoph

PS: Last but not least below your can find some information how we configure the iMX EdgeLock Enclave:

In our device-tree we have mailbox definition "s4muap" for the "i.MX Messaging Unit" from the standard "imx93.dtsi":

 s4muap: mailbox@47520000 {
         compatible = "fsl,imx93-mu-s4";
         reg = <0x47520000 0x10000>;
         interrupts = <GIC_SPI 31 IRQ_TYPE_LEVEL_HIGH>,
                      <GIC_SPI 30 IRQ_TYPE_LEVEL_HIGH>;
         interrupt-names = "tx", "rx";
         #mbox-cells = <2>;
 };

...

 ele_mu: ele-mu {
         compatible = "fsl,imx93-ele";
         mboxes = <&s4muap 0 0 &s4muap 1 0>;
         mbox-names = "tx", "rx";
         fsl,ele_mu_did = <3>;
         fsl,ele_mu_id = <2>;
         fsl,ele_mu_max_users = <4>;
         status = "okay";
 };

And the "ele_reserved" memory area from NXP commit
"1cef27794d6d LF-8071-1 arm64: dts: imx93: use a reserved mem-ranges to constrain ele-mu dma-range":


  reserved_memory: reserved-memory {
         ranges;
         #address-cells = <2>;
         #size-cells = <2>;

         ...

         ele_reserved: ele-reserved@a4120000 {
                 compatible = "shared-dma-pool";
                 reg = <0 0xa4120000 0 0x100000>;
                 no-map;
         };
  };

&ele_mu {
memory-region = <&ele_reserved>;
}

AldoG

Hello,

It is the TRDC's Memory Region Controller provides domain-based, hardware access control for all system bus references targeted at non-peripheral memory spaces.

For each region descriptor hit, the MRC logic evaluates the access rights defined by the MRCm_DOMd_RGDr_Ww registers.
Specifically, the domainID attribute selects the appropriate MRCm_DOMd_RGDr_Ww register to use in the access evaluation.

There is an access error for three conditions:
1. If the access does not hit in any region descriptor, an access error is reported.

2. If the access hits in a single region descriptor and that region signals a domain violation, then an access error is reported.

3. If the access hits in multiple (overlapping) regions and one region signals a violation, then an access error is reported.

The third condition reflects that priority is given to access denying over access allowing for overlapping regions. Unimplemented domain IDs (DIDs) do not have any associated region descriptors and therefore have no access rights.

All this information is available on the i.MX93 Reference Manual Chapter 22
Trusted Resource Domain Controller (TRDC), more especifically 22.3.6.2 Memory region access evaluation.

Best regards/Saludos,
Aldo.

cstoidner

Hello,

Ok, I can see in the reference manual there is the "domain error capture management", in section "22.6.3".
And as far as I understand, it captures for each violation the triggering address and some attribute information.
It seems to me, that is what I was searching for in my questions from above.

However, I couldn't find any software for that feature.
Is there maybe already some software implementation for this "domain error capture management"? And is that the way to read-out all information about a violating access, after a violation was detected?

Recall, my goal was to determine (1) who did the violating access and (2) what exactly was the violating access. All this is to identify and solve the reason for the "CSU reset" we see on some of our boards.

Thanks,
Christoph

cstoidner

> Recall, my goal was to determine (1) who did the violating access and
> (2) what exactly was the violating access. All this is to identify and solve
> the reason for the "CSU reset" we see on some of our boards.

To be more precise: I want to know: Who did access what, that lead to the CSU Reboot, we see on or boards.

AldoG

Hello @cstoidner,

Please help me with the following information:
What is the interval time between the board power up and unexpected reboot?
Could you share the log when unexpected reboot?

Best regards/Saludos,
Aldo.

cstoidner

Hi Aldo,

> What is the interval time between the board power up and unexpected reboot?
> Could you share the log when unexpected reboot?

I asked my colleague who can reproduce the problem to capture a log with timestamps. As soon as I have the log available I will give you the interval time and shared the log with you.

Regards,
Christoph

AldoG

Hello,

Indeed I see your point, I'm checking internally wheter there is something that it is not documented or if there is some part of code that may be causing the reset on your design.

Best regards/Saludos,
Aldo.