SCHECK/BIST

HeebeomPark

I have a question regarding the following software assumption from the Safety Manual:

Software Assumption:
During boot, if the master safety core determines that a detected fault is permanent, the application software either initiates a reset or stops using the functional region with the permanent fault and enters a degraded mode.

Rationale:
Increases application availability in the case of a permanent fault.

We are trying to interpret the rationale:
Does "increases application availability in the case of a permanent fault" imply that the faulty region should be disabled after reset, allowing the system to continue in degraded mode?

In our implementation, we initially used a counter to track transient faults after executing SCheck or BIST. However, based on feedback from your application engineer, it seems that once a BIST detects a fault, it is considered permanent, and therefore a counter may not be necessary.

Could you please confirm:

Is it correct that faults detected by BIST are always considered permanent?
In such cases, is it unnecessary to use a counter to track fault occurrences?
Is the intended behavior to disable the faulty region and continue in degraded mode without further fault tracking?

Also, we noted the recommendation to use the S32 Safety Software Framework (ModeSelector module). Could you clarify how this module supports the handling of permanent faults and degraded mode operation?

Thank you for your support.

RadoslavB

Hello @HeebeomPark ,

BIST detects permanent faults in the HW, so yes with first failing of some LBIST or MBIST instance it is considered that we can't rely on that BIST instance => no need for fault counter.
There is one exception for MBIST failure (depending which S32 platform you are using) - it doesn't distinguish if some bit flips in the memory are single bit flip or multiple bit flips. For single bit flips there is ECC mechanism able to correct the corrupted value, so your application can continue in this case. To determine if it's single or multi bit flip, please check latest SAF release for function Bist_MtrDiagWorkaround().

Please consider sCheck test also as they detect permanent faults in the HW, so no need to use any counter, it's for sure permanent fault with first failure finding.

Degraded modes - it is just optional feature which can be utilized by customer application or not, it's not mandatory from ISO26262 perspective.
I'll try to explain on this example:
Let's say that on your S32 device there are 3 SRAM controllers each separately tested by MBIST or sCheck SRAM ECC tests.
For your "Normal Mode" of the safety application it is using all of them for some safety critical application processing.
MBIST or sCheck then find failure at SRAM controller 1, so you can't rely on ECC mechanism for this controller, therefore you can't use part of SRAM memory under control of SRAM controller 1.
Now it's SW architecture decision if you will design application in a way that it requires to have all SRAM controllers without errors => SRAM Contr. 1 failing = showstopper.
Or you will reduce partially functionality of your application, so stop executing some core or some tasks or some safety features using data in SRAM 1, but let continue the other functionality => that's what we call Degraded mode.

ModeSelector is here to help you with definition of the Degraded Modes - you can select which fault sources are mandatory to be fault free for Normal Mode and which for Degraded Mode (Fault sources coming from eMcem-FCCU, BIST, sCheck).
So if Normal Mode requires all SRAM controllers to be fault free, all MBIST must be OK, all sCheck SRAM ECC tests must be OK.
But for Degraded Mode you can remove specific MBIST partition and specific SRAM ECC tests from the fault sources so Mode Selector will report that Normal Mode was not able to choose but Degraded yes and you can run your application in a reduced functionality mode which doesn't use SRAM contr. 1 for safety related application context.
Please see mSel UM for more details.

Kind Regards,
Radoslav

View solution in original post

john_floros

Hello,

I can confirm that BIST faults are permanent faults.

The reset of your questions seem more of a system level question that is best answered by application needs and safety goals are.

Regards,

John

RadoslavB

Hello @HeebeomPark ,

BIST detects permanent faults in the HW, so yes with first failing of some LBIST or MBIST instance it is considered that we can't rely on that BIST instance => no need for fault counter.
There is one exception for MBIST failure (depending which S32 platform you are using) - it doesn't distinguish if some bit flips in the memory are single bit flip or multiple bit flips. For single bit flips there is ECC mechanism able to correct the corrupted value, so your application can continue in this case. To determine if it's single or multi bit flip, please check latest SAF release for function Bist_MtrDiagWorkaround().

Please consider sCheck test also as they detect permanent faults in the HW, so no need to use any counter, it's for sure permanent fault with first failure finding.

Degraded modes - it is just optional feature which can be utilized by customer application or not, it's not mandatory from ISO26262 perspective.
I'll try to explain on this example:
Let's say that on your S32 device there are 3 SRAM controllers each separately tested by MBIST or sCheck SRAM ECC tests.
For your "Normal Mode" of the safety application it is using all of them for some safety critical application processing.
MBIST or sCheck then find failure at SRAM controller 1, so you can't rely on ECC mechanism for this controller, therefore you can't use part of SRAM memory under control of SRAM controller 1.
Now it's SW architecture decision if you will design application in a way that it requires to have all SRAM controllers without errors => SRAM Contr. 1 failing = showstopper.
Or you will reduce partially functionality of your application, so stop executing some core or some tasks or some safety features using data in SRAM 1, but let continue the other functionality => that's what we call Degraded mode.

ModeSelector is here to help you with definition of the Degraded Modes - you can select which fault sources are mandatory to be fault free for Normal Mode and which for Degraded Mode (Fault sources coming from eMcem-FCCU, BIST, sCheck).
So if Normal Mode requires all SRAM controllers to be fault free, all MBIST must be OK, all sCheck SRAM ECC tests must be OK.
But for Degraded Mode you can remove specific MBIST partition and specific SRAM ECC tests from the fault sources so Mode Selector will report that Normal Mode was not able to choose but Degraded yes and you can run your application in a reduced functionality mode which doesn't use SRAM contr. 1 for safety related application context.
Please see mSel UM for more details.

Kind Regards,
Radoslav