Hello,
We recently updated to SDK 25.12 and noticed that our TLS decryption rate has been cut in half.
mbedTLS v3.x is no longer accelerated using the fsl_hashcrypt hardware features.
Here is the callstack of calling mbedtls_ssl_read using the previous SDK 25.09. As you can see, HASHCRYPT_AES_EncryptEcb is eventually used.
hashcrypt_aes_one_block_aligned() at fsl_hashcrypt.c:437
hashcrypt_aes_one_block() at fsl_hashcrypt.c:581
HASHCRYPT_AES_EncryptEcb() at fsl_hashcrypt.c:1,284
mbedtls_internal_aes_encrypt() at aes_alt.c:1,959
mbedtls_aes_crypt_ecb() at aes_alt.c:1,323
aes_crypt_ecb_wrap() at cipher_wrap.c:114
mbedtls_cipher_update() at cipher.c:521
mbedtls_gcm_update() at gcm.c:358
mbedtls_gcm_crypt_and_tag() at gcm.c:456
mbedtls_gcm_auth_decrypt() at gcm.c:491
mbedtls_cipher_aead_decrypt() at cipher.c:1,407
mbedtls_cipher_auth_decrypt_ext() at cipher.c:1,613
mbedtls_ssl_decrypt_buf() at ssl_msg.c:1,242
ssl_prepare_record_content() at ssl_msg.c:3,667
ssl_get_next_record() at ssl_msg.c:4,551
mbedtls_ssl_read_record() at ssl_msg.c:3,817
mbedtls_ssl_read() at ssl_msg.c:5,237
<...more frames...>Here is the callstack of SDK 25.12 with MBEDTLS_USE_PSA_CRYPTO defined. In this version, mbedtls_internal_aes_encrypt is all C code with no HW acceleration.
mbedtls_internal_aes_encrypt() at aes.c:894
mbedtls_aes_crypt_ecb() at aes.c:1,062
aes_crypt_ecb_wrap() at cipher_wrap.c:166
mbedtls_cipher_update() at cipher.c:611
gcm_mask() at gcm.c:546
mbedtls_gcm_update() at gcm.c:641
mbedtls_gcm_crypt_and_tag() at gcm.c:726
mbedtls_gcm_auth_decrypt() at gcm.c:753
mbedtls_psa_aead_decrypt() at psa_crypto_aead.c:270
psa_driver_wrapper_aead_decrypt() at psa_crypto_driver_wrappers.h:4,114
psa_aead_decrypt() at psa_crypto.c:5,023
mbedtls_ssl_decrypt_buf() at ssl_msg.c:1,625
ssl_prepare_record_content() at ssl_msg.c:4,093
ssl_get_next_record() at ssl_msg.c:5,068
mbedtls_ssl_read_record() at ssl_msg.c:4,323
mbedtls_ssl_read() at ssl_msg.c:5,983
<...more frames...>And here is the callstack of SDK 25.12 withoutMBEDTLS_USE_PSA_CRYPTO defined. In this version, mbedtls_internal_aes_encrypt is all C code with no HW acceleration and PSA is not involved.
mbedtls_internal_aes_encrypt() at aes.c:899
mbedtls_aes_crypt_ecb() at aes.c:1,062
aes_crypt_ecb_wrap() at cipher_wrap.c:166
mbedtls_cipher_update() at cipher.c:611
gcm_mask() at gcm.c:546
mbedtls_gcm_update() at gcm.c:628
mbedtls_gcm_crypt_and_tag() at gcm.c:726
mbedtls_gcm_auth_decrypt() at gcm.c:753
mbedtls_cipher_aead_decrypt() at cipher.c:1,528
mbedtls_cipher_auth_decrypt_ext() at cipher.c:1,674
mbedtls_ssl_decrypt_buf() at ssl_msg.c:1,639
ssl_prepare_record_content() at ssl_msg.c:4,093
ssl_get_next_record() at ssl_msg.c:5,068
mbedtls_ssl_read_record() at ssl_msg.c:4,323
mbedtls_ssl_read() at ssl_msg.c:5,983
<...more frames...>Are there plans to restore RT685 HASHCRYPT hardware acceleration to mbedTLS? It seems certain PSA Crypto drivers are not implemented.
Thank you.
Hi Edwin,
I reproduced the issue on the EVK. I modified two samples by adding a loop of 200 iterations of mbedtls_gcm_self_test and timed the overall execution using the RTC clock.
evkmimxrt685_mbedtls_selftest_cm33 from SDK 25.09 executed the tests in 1087ms with this callstack:
HASHCRYPT_AES_EncryptEcb() at fsl_hashcrypt.c:1,260
mbedtls_internal_aes_encrypt() at aes_alt.c:1,959
mbedtls_aes_crypt_ecb() at aes_alt.c:1,323
aes_crypt_ecb_wrap() at cipher_wrap.c:114
mbedtls_cipher_update() at cipher.c:521
mbedtls_gcm_starts() at gcm.c:294
mbedtls_gcm_crypt_and_tag() at gcm.c:452
mbedtls_gcm_self_test() at gcm.c:826evkmimxrt685_mbedtls3x_psatest_cm33 from SDK 25.12 executed the test in 8990ms with this callstack:
mbedtls_internal_aes_encrypt() at aes.c:896
mbedtls_aes_crypt_ecb() at aes.c:1,062
aes_crypt_ecb_wrap() at cipher_wrap.c:166
mbedtls_cipher_update() at cipher.c:611
mbedtls_gcm_starts() at gcm.c:441
mbedtls_gcm_crypt_and_tag() at gcm.c:718
mbedtls_gcm_self_test() at gcm.c:1,075
evkmimxrt685_mbedtls3x_psatest_cm33 from SDK 25.12 with MBEDTLS_PSA_ACCEL_KEY_TYPE_AES defined executed the test in 8744ms with this callstack:
HASHCRYPT_AES_EncryptEcb() at fsl_hashcrypt.c:1,255
hashcrypt_cipher_encrypt() at mcux_psa_hashcrypt_common_cipher.c:187
psa_driver_wrapper_cipher_encrypt() at psa_crypto_driver_wrappers.h:2,353
psa_cipher_encrypt() at psa_crypto.c:4,766
mbedtls_block_cipher_encrypt() at block_cipher.c:177
mbedtls_gcm_starts() at gcm.c:439
mbedtls_gcm_crypt_and_tag() at gcm.c:718
mbedtls_gcm_self_test() at gcm.c:1,075The gist of the modifications to the examples is the following:
BOARD_InitHardware();
test_rtc_init();
psa_crypto_init();
uint64_t ms_start = test_rtc_get_msecs();
for (int i = 0; i < 200; ++i)
{
PRINTF("test iteration %d\r\n", i+1);
mbedtls_gcm_self_test(0);
}
uint64_t ms_end = test_rtc_get_msecs();
PRINTF("test time = %ums\r\n", (unsigned)(ms_end - ms_start));... where test_rtc_get_msecs returns the current RTC time using subsecond accuracy.
As you can see, there is an ~8x slow down for encrypting GCM/AES with the new SDK.
I can attach the modified examples if you'd like.
Regards,
Amilcar
Hi @hrc-amilcar,
Thanks for your patience with this question. I just received the response from the internal team, please see below.
From the call stack I can see you are using legacy mbedtls_xxx crypto API. Indeed, it is not HW accelerated. MbedTLS3.x introduced new API for Crypto, and it is PSA. mbedtls/docs/psa-transition.md at v3.6.5 · Mbed-TLS/mbedtls · GitHub and it is accelerated. Legacy crypto API is further removed in MbedTLS4.x.
I checked our psa_crypto_examples in SDK for RT600 and HASHCRYPT HW acceleration is enabled by default by having the PSA_CRYPTO_DRIVER_HASHCRYPT defined which enables crypto driver wrappers to offload crypto computation to HW. On the other hand, MbedTLS3.x+ is more complex and aligned with PSA API specification so it may happen that some use-cases are indeed producing lower performance. In case of TLS I would expect rather asymmetric cryptography (CASPER HW IP) to be bottleneck of performance, since this IP can support only bignum acceleration and it is not well compatible with PSA API that expect HW IP to implement whole algorithm. We did our best to accelerate at least some ECC operations (sign, verify) but it may happen that other operation like keygen during ECDHE key exchange may be worse.
Now it would be good if you can use the PSA API for performance measurement and confirm that PSA_CRYPTO_DRIVER_HASHCRYPT is defined and call stack uses it. FYI: Hashcrypt did not offer AES-GCM acceleration natively, so it is better to benchmark AES-CBC or AES-CTR to see real gain of HW IP.
BR,
Edwin.
Hi @hrc-amilcar,
Migrating from mbedTLS 2.x (without PSA) to mbedTLS 3.x (with PSA) is bound to have performance drops, considering that:
"The PSA Driver Interface has only been partially implemented. As a result, the deliverables for writing a driver and the method for integrating a driver with Mbed TLS will vary depending on the operation being accelerated." (https://mcuxpresso.nxp.com/mcuxsdk/latest/html/middleware/mbedtls3x/docs/psa-driver-example-and-guid...
At the moment, the best I can recommend is to follow the guidelines on how to properly migrate from 2.x to 3.x: Migrating from Mbed TLS 2.x to Mbed TLS 3.0 — MCUXpresso SDK Documentation
As well as the guide to properly transition to the PSA API: Transitioning to the PSA API — MCUXpresso SDK Documentation
I apologize for the inconvenience.
BR,
Edwin.
Hi @hrc-amilcar,
Did you make any changes after updating the SDK? Do you also see this behavior with the SDK's example code? Are you using the standalone IDE or the VS Code extension?
BR,
Edwin.
HASHCRYPT_AES_EncryptEcb() at fsl_hashcrypt.c:1,255
hashcrypt_cipher_encrypt() at mcux_psa_hashcrypt_common_cipher.c:203
psa_driver_wrapper_cipher_encrypt() at psa_crypto_driver_wrappers.h:2,353
psa_cipher_encrypt() at psa_crypto.c:4,766
mbedtls_block_cipher_encrypt() at block_cipher.c:177
gcm_mask() at gcm.c:543
mbedtls_gcm_update() at gcm.c:628
mbedtls_gcm_crypt_and_tag() at gcm.c:726
mbedtls_gcm_auth_decrypt() at gcm.c:753
mbedtls_psa_aead_decrypt() at psa_crypto_aead.c:270
psa_driver_wrapper_aead_decrypt() at psa_crypto_driver_wrappers.h:4,114
psa_aead_decrypt() at psa_crypto.c:5,023
mbedtls_ssl_decrypt_buf() at ssl_msg.c:1,625
ssl_prepare_record_content() at ssl_msg.c:4,093
ssl_get_next_record() at ssl_msg.c:5,068
mbedtls_ssl_read_record() at ssl_msg.c:4,323
mbedtls_ssl_read() at ssl_msg.c:5,983
<...more frames...>Hi @EdwinHz,
As I was stepping through the code, I noticed that perhaps MBEDTLS_BLOCK_CIPHER_C needs to be defined so that the gcm operation can be accelerated.
It looks like the header mbedtls3x/include/mbedtls/config_adjust_legacy_crypto.h is responsible for this but doesn't end up defining that macro for some reason.
For others' benefit...
It appears that mbedtls_xor from mbedTLS 3.x is looping over data 4-bytes at a time and calling mbedtls_get_unaligned_uint32 and mbedtls_put_unaligned_uint32 which both use memcpy for a single uint32.
I recognize that the authors of mbedTLS are trying to improve performance by calculating blocks of XOR 4-bytes at a time (with a remainder loop), but the full calls to memcpy are actually making the code slower.
After some investigation into the disassembly, it turns out that our project was being compiled with -fno-builtin, preventing small memcpys from getting inlined by the compiler.
Removing that option restored much of the performance loss of HASHCRYPT hardware not being used for AES-GCM operations. So, the example I posted went from executing in 8600ms down to 2100ms. Not quite at the level of mbedTLS 2.x + ksdk alt (at 1087ms). But the restored performance is good enough.
-Amilcar