rt1064 computation errors in mbedtls_rsa_public, fixed by cooling the board

dmckeever · ‎09-28-2023

We have a custom board design based on the MIMXRT1064-EVK.

We are encountering errors in the execution of the mbedtls_rsa_public function

from the MBEDTLS library when used to verify a certificate.

Calls to mbedtls_rsa_public( ctx, sig, encoded ) with identical input values of ctx and sig produce different values of encoded. In some cases the values are correct and match encoded_expected; in other cases they are incorrect and do not match encoded_expected.

Some of the incorrect values are identical.

The error arises in the modular exponentiation function mbedtls_mpi_exp_mod

Some of our board instances behave perfectly with no errors; other boards produce errors, either all the time or intermittently.

We noticed in testing that boards are typically error free for a short time after a "cold restart", ie after they have been rested; subsequently we found that applying a refrigerant spray to a "bad" board (to the area around the processor) causes the errors to disappear until the board warms up again.

Tests using the temperaturemonitor project from the SDK show that the processor is not running unduly hot (approx 40 C); this is confirmed with an IR temperature sensor.

I would not have expected a temperature issue to cause incorrect computation results.

All suggestions welcome.

dmckeever · ‎10-03-2023

We have carried out further tests and the results are pointing to the SDRAM chip as the source of the problem. This is a summary:

"bad” boards typically work ok once just after a “cold” reboot: TLS handshaking without errors and mails sent successfully
“bad” boards work if refrigerant spray is applied to the SDRAM chip alone
when stack and heap are moved from SDRAM to on-chip RAM, “bad” boards work successfully

In addition, we have observed some random corruption of other settings values stored on the SDRAM.

We built a separate RAM test project to perform a write then read check of all locations: this has shown no failures on “bad” boards

So we are now seeking possible reasons for failures in SDRAM access. Were it not for the refrigerant cure and the different behaviour across board instances, we would be looking at the possibility of some inadvertent corruption of SDRAM during execution of the application.

We found this question: https://community.nxp.com/t5/i-MX-RT/i-MXRT1060-SEMC-SDRAM-Data-Corruption/m-p/1172919#M10887

We have tried the fix suggested but it has not improved things.

dmckeever · ‎10-04-2023

We have found the reason for SDRAM access issues.
Some time ago we reduced drive strengths on pins from SEMC to SDRAM in order to reduce emissions.
Increasing the drive strengths has fixed the access issues.
Our memory tests at the time did not reveal any problems. The TLS handshaking appears to be a more demanding test.

Bio_TICFSL · ‎09-29-2023

Hello,

This is an issue with the mbed code, you have to install the latest image:

https://github.com/Mbed-TLS/mbedtls

Regards

dayson · ‎10-23-2023

I am also developing on a custom RT1064 board (with SDRAM) and am experiencing invalid responses when running mbedtls_rsa_public() using mbedtls. We have not yet seen a valid response from the function (even on cold-start), nor have we had any memory corruption issues (as @dmckeever speaks of). Our SDRAM drive strengths are all set as default.

I am exploring updating to the latest version of mbedtls as you listed above but I am having a lot of trouble porting the library into the mcuxpresso sdk. Particularly, the newer version of mbedtls doesn't seem to drop into the ksdk hardware acceleration port provided in the SDK. Do you have any advice that could help? Also, can I ask why you believe upgrading the mbedtls library will resolve the issue, was it a known bug in the previous version?

dmckeever · ‎10-23-2023

Before discovering the memory errors issue, we considered upgrading to the latest mbedtls library because we had run out of things to try - there was no known bug reported. However, the difference in code and data structures made the port extremely awkward so we abandoned it (at least for the present); we may look at it again because we have to attempt to support STARTTLS for SMTP.

Our first implementation had placed both stack and heap in SDRAM. We discovered first of all that moving the heap to on-chip RAM solved the problem of wrong computation results, then later discovered that also moving stack to on-chip RAM solved other issues of incomplete handshaking. So the SDRAM was the clear culprit and we then found the drive strengths cause.

dayson · ‎10-23-2023

@dmckeever that's very helpful. Thanks for taking the time to respond. Oh yes, tonight I started diving into the mbedtls upgrade and know exactly what you mean. Seems like a huge job.

Tomorrow I'll switch the heap & stack to the on-chip memory and see if that resolves the issue for me also. Thanks again for the info.

dmckeever · ‎10-23-2023

@dayson. Good luck with this. It will be interesting to find out if the SDRAM is also an issue with you. Presumably your custom board is based on the MIMXRT1064-EVK board design? It appears that SDRAM address and data track lengths are critical.

( I am currently struggling with another baffling issue: changing the start address in XIP flash where code is located changes the behaviour of the program and causes failure ...)

dayson · ‎11-05-2023

Hi @dmckeever! I'm glad to report that we got it working with the SDRAM. After a long journey exploring updating the mbedtls library and migrating the mbedtls functionality to the internal RAM, we realised that the primary issue was that we hadn't called CRYPTO_InitHardware(). Sooooooo silly looking back now haha so glad we've got it working now though. It's a shame that rather than seeing an error message notifying us that the hardware crypto hadn't been initialised, we just got garbage responses from the crypto APIs

I was very concerned that it was a hardware issue after our conversations because we were very careful to route our SDRAM traces with <100ps delay length matching etc. Due to space constraints on our board I wasn't sure how we were going to improve the routing.

Considering you were able to get intermittent success, I'm guessing that your team wasn't so silly as to forget to init the crypto hardware and perhaps you will need to re-route your SDRAM chip.

dmckeever · ‎12-01-2023

Hi @dayson
Sorry for delay in responding.
Glad to hear that you got things working
As I reported elsewhere, it was an issue with our SDRAM performance. Earlier, we had to reduce pin drive strengths to achieve a pass on radiated emissions tests.
Preliminary tests appeared to show that SDRAM was working fine at the lower drive strengths. However the crypto calculations appear to stress the SDRAM, and were failing.( I do not quite understand why.)
Restoring drive strengths achieved perfect behaviour of the crypro calculations.

Now I have to devise a test program to find a calculation that will stress the SDRAM and provide a way to step through drive strengths: try to find a setting that will pass emissions tests without crippling the SDRAM.

dmckeever · ‎10-03-2023

Thank you for the suggestion. However our subsequent experiments are pointing to the SDRAM as the source of the problem.