> but assuming if there is a checksum code that runs and if it is corrupted than it won't be corrected into the next cycle right?
No, we are talking about different things here.
What you mean is ECC, which most MCUs nowadays implement. This feature can detect and correct certain memory errors at runtime, without the core itself noticing it.
What I meant is ROM-based code that can run a checksum calculation over the whole user flash, to detect corruptions/modifications. Not many MCUs implement this feature, but it is useful in an environment where safety or security requirements must be observed.
As said, I don't know if your MCU implements this feature, you need to study the datasheet.
But other reasons for occasional fails are possible.
If one of the other mode pins is used for interface functionality in your application, external signals (like ongoing serial communication on a bus system) could have an effect during startup.
Electrically improper connections with other systems can cause shifts in electrical potentials, and thus create stray signals. A common problem I have/had with CAN-bus systems.
Your PCB / hardware design might not be fully stable. Tests at increased temperature would reveal issues.
The same goes for internal instabilities. The clock tree setup can fail occasionally if you are at the limit, or memory access can fail as well. Elevated temperatures should increase the error probability, and reducing core speed and / or relaxing memory wait state settings would fix it.
Issues with the quartz, MCU clock input wiring and PCB design might also cause issues, but I am no hardware expert.
By the way, I would make sure if the MCU is really starting into ISP mode, or going into a hardfault or lockup.
If the latter is the case, no ISP access (serial interface ?) is possible without a power cycle.