The SHA engine in the LPC55S28 produces erroneous results using the example code provided if optimization level is increased to -o2
I will go into a lot of related detail and example code here in the hope of getting a fix. However, please note that your own SDK example code in lpcxpresso55s28_mbedtls_benchmark fails unmodified apart from changing the compiler to -o2 there is a real issue.
Tools used
The SDK example lpcxpresso55s28_mbedtls_benchmark runs through various crypto operations and times them. However, it does not check the results! If you run the test as far as the SHA256 test and use the debugger to actually check the result you will find a correct sha256 result as expected. So far so good.
However, if you make the one simple change of changing the compiler optimization level from -o0 to -o2 in the project settings and run the same test you will now find that the SHA256 result is in fact incorrect! (Incidentally, I chose -o2 rather than -o3 as I have previously found issues with the USB examples at -o3)
In order to figure out what is happening, I created a much simpler test case. I took the lpcxpresso55s28_mbedtls_benchmark test and replaced the contents of benchmark.c with the following code...
#include MBEDTLS_CONFIG_FILE
#include "mbedtls/version.h"
#include "fsl_debug_console.h"
#include "fsl_clock.h"
#include "pin_mux.h"
#include "clock_config.h"
#include "board.h"
#include "ksdk_mbedtls.h"
#include "mbedtls/sha256.h"
#include "fsl_power.h"
// correct hash256
// 68 93 a0 8c 9c 4c c9 7c 22 79 9a 6a a7 af 6f 8b
// 8b c2 2d 54 1a 5e 42 53 b1 80 bb c9 74 c9 d4 97
unsigned char buf[8];
uint8_t test_sha_buf[32];
int main( int argc, char *argv[] )
{
POWER_SetBodVbatLevel(kPOWER_BodVbatLevel1650mv, kPOWER_BodHystLevel50mv, false);
CLOCK_AttachClk(BOARD_DEBUG_UART_CLK_ATTACH);
BOARD_InitBootPins();
BOARD_InitBootClocks();
BOARD_InitDebugConsole();
CRYPTO_InitHardware();
memset( buf, 0xAA, 8 );
mbedtls_sha256_ret( buf, 8, test_sha_buf, 0 );
for (int i=0; i<32; i++) {
PRINTF("%02x ",test_sha_buf[i]);
if ((i & 7)==7) PRINTF(" ");
if ((i & 15)==15) PRINTF("\n");
}
while (1)
{
}
}
As you can see this code finds the SHA256 of 8 bytes for 0xAA. Sure enough, if you run this at -o0 it works, if however, you run it at -o2 it fails.
I have spent considerable time looking into this and have eventually found a way to localise the issue BUT NOT PROPERLY FIX IT. It increasingly looks to me that there is an issue with data in a processor (or bus) write buffer and not being flushed to RAM before the SHA engine starts running.
Firstly, by single-stepping at assembly level, I checked that all the reads and writes to RAM and the SHA engine control registers happen in the same order and with the same data regardless of the optimization level. So it seems not to be a logical problem.
Moving on from this I created two test cases differing only in a tiny number of (seemingly insignificant) assembly instructions added into the code at a strategic place. These assembly instructions will cause the SHA to work properly at -o2. The code for the two examples was carefully crafted and linked to be identical apart from the inserted assembly code (I will explain how to recreate my findings below). This allows side by side diff of the trace files and highlights any differences between the traces. The trace files are run_fail.txt & run_working.txt they can be compared side by side with a tool such as 'meld'. The files run_fail_annotated.txt & run_working_anotated.txt are the same files but with some hand-added descriptions.
I include instruction trace dumps on for the working and failing examples both created at -o2. There is a lot of preamble before getting to the SHA but once the trace gets to pc = 0x370c the fun starts.
Looking at the failing example
0x78038 -> 0x780e0 Entering HASHCRYPT_SHA_Finish function
0x2fc2 Entering memcpy (copy SHA data into place)
0x370c -> 0x372e the data to be sent for SHA is being copied in to RAM @ 0x2002ff08 & 0x2002ff0c
NO FIX
0x780e4 -> 0x780e8 Wait for SHA engine to be free
0x7806e -> 0x78076 Configure and start the SHA engine
0x78078 -> 0x7807c Wait for SHA to finish
0x7807e -> 0x78088 Do the output size checks
0x7808c -> 0x78090 Wait for SHA to finish (again!)
0x78092 -> 0x7809c Read out first word of SHA --- it is incorrect 0x323b4b63
Looking at the working example
0x78038 -> 0x780e0 Entering HASHCRYPT_SHA_Finish function
0x2fc2 Entering memcpy (copy SHA data into place)
0x370c -> 0x372e the data to be sent for SHA is being copied in to RAM @ 0x2002ff08 & 0x2002ff0c
0x780e4 -> 0x780f8 Inserted fix code
0x780fa -> 0x780fe Wait for SHA engine to be free
0x7806e -> 0x78076 Configure and start the SHA engin
0x78078 -> 0x7807c Wait for SHA to finish
0x7807e -> 0x78088 Do the output size checks
0x7808c -> 0x78090 Wait for SHA to finish (again!)
0x78092 -> 0x7809c Read out first word of SHA --- it is correct 0x6893a08c
Side by side diff shows that up until the added asm code the traces are identical. After the "fix" things are the same until the readout of data.
Note the "fix" does corrupt the stack but not in any region that is then used before the SHA readout.
The exact instructions in the "fix" are important! I spent a good deal of time finding this minimal example. Originally I tried inserting nop instructions and found that in places this could make things work. However, closer examination showed the nop instructions were simply forcing the c compiler to produce different code in this vicinity (mostly preventing inlining). It was one of these "differing" sequences that gave me the inspiration for the eventual minimal "fix"
SO IN CONCLUSION...
Clearly, this "fix" is not a real solution as I have no real idea why this change makes things work. However, having spent most of my life in ASIC design my hunch is that you have a data flushing bug. The data destined for the SHA hardware assist has not in fact reached system memory by the time the HW assist engine starts its work.
Note the "fix" writes to memory regions at a close but not the same address as the real data. My hunch would be that this forces a flush.
The original -o0 code does many more memory and code accesses between the write to memory and the trigger of the SHA engine. And so by accident causes a flush.
Of course, there is no way to confirm this for certain from where I stand. However, It is obvious, even from your own example code, that you have a problem and I hope this bug report helps find it and create a real work around.
Notes on creating the trace dumps.
In order to be able to modify the one c-function of interest and not find that the linker had placed many functions in different places making side-by-side comparison extremely difficult. I remapped the HASHCRYPT_SHA_Finish in drivers/fsl_hashcrypt.c to a location higher in the FLASH memory.
Firstly I created a modified address map by shrinking the standard PROGRAM_FLASH region to be size 0x78000 and creating a new region at 0x78000 of size 0x8000
Secondly, I decorated the HASHCRYPT_SHA_Finish function with
__attribute__ ((section(".text_Flash2")))
status_t HASHCRYPT_SHA_Finish(HASHCRYPT_Type *base, hashcrypt_hash_ctx_t *ctx, uint8_t *output, size_t *outputSize)
Now I can change HASHCRYPT_SHA_Finish with minimal impact on the rest of the code
As the LPC55S28 does not have full instruction trace and no data trace I had to use other methods. I scripted the GDB debugger to single step the code one instruction at a time and dump all CPU registers after each step. The GDB script looks like this:
print "AJF Starting"
# -o2
while ( $r15 != 0xe9a)
x/i $pc
i r $r0 $r1 $r2 $r3 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $r12 $r13 $r14 $r15
si
end
print "AJF Done!"
It is placed in a file (go.gdb) and run by entering the Debugger Console and entering
source FULL_PATH_TO_ABOVE_FILE/go.gdb
When the trace finishes a cut and paste the Debugger Console output into a separate file and parse this with the python program (gdb_parse.py) to create trace dumps.
It may be necessary to increase the number of buffered lines in the debugger console
Preferences -> [ search for 'console' ]
C/C++ -> Debug -> GDB -> Console -> Console buffer lines:
Note the registers on each line of the trace are the dump BEFORE the instruction runs.
The "FIX" code is insered in the hashcrypt_sha_finalize function thus
static status_t hashcrypt_sha_finalize(HASHCRYPT_Type *base, hashcrypt_sha_ctx_internal_t *ctxInternal)
{
hashcrypt_sha_block_t lastBlock;
(void)memset(&lastBlock, 0, sizeof(hashcrypt_sha_block_t));
/* this is last call, so need to flush buffered message bytes along with padding */
if (ctxInternal->blksz <= 55u)
{
/* last data is 440 bits or less. */
(void)hashcrypt_memcpy(&lastBlock.b[0], &ctxInternal->blk.b[0], ctxInternal->blksz);
__asm volatile(
"ldr r3, [r5, #72]\n"
"lsls r3, r3, #3\n"
"rev r3, r3\n"
"str r3, [sp, #60]\n"
"ldr r3, [r5, #64]\n"
"add r3, sp\n"
"push {r1}\n"
"movs r1, #128\n"
"strb.w r1, [r3]\n"
"pop {r1}\n"
);
lastBlock.b[ctxInternal->blksz] = (uint8_t)0x80U;
lastBlock.w[SHA_BLOCK_SIZE / 4U - 1U] = swap_bytes(8u * ctxInternal->fullMessageSize);
hashcrypt_sha_one_block(base, &lastBlock.b[0]);
}
else
{
if (ctxInternal->blksz < SHA_BLOCK_SIZE)
{
ctxInternal->blk.b[ctxInternal->blksz] = (uint8_t)0x80U;
for (uint32_t i = ctxInternal->blksz + 1u; i < SHA_BLOCK_SIZE; i++)
{
ctxInternal->blk.b[i] = 0;
}
}
else
{
lastBlock.b[0] = (uint8_t)0x80U;
}
hashcrypt_sha_one_block(base, &ctxInternal->blk.b[0]);
lastBlock.w[SHA_BLOCK_SIZE / 4U - 1U] = swap_bytes(8u * ctxInternal->fullMessageSize);
hashcrypt_sha_one_block(base, &lastBlock.b[0]);
}
/* poll wait for final digest */
while (0U == (base->STATUS & HASHCRYPT_STATUS_DIGEST_MASK))
{
}
return kStatus_Success;
}
Please advise on this subject as without a solution to this problem my product cannot be completed and may have to be ported to a different device!
I attach a project archive and all the files and scripts mentioned above
Solved! Go to Solution.
Thanks for verifying that SDK2.10.1 has resolved the issue.
And thanks for your comment. I will feed it back.
Thank you,
I have done tests and looked at diffs and yes this problem has been fixed. However, the comment associated with the fix is definately not correct! The fix is inserting ARM 'dsb' instructions to flush out writes which is exactly the sort of thing I was hoping to see. The comment however is this...
/* Data Synchronization Barrier prevent compiler from reordering memory write when -O2 or higher is used. */
/* The address is passed to the crypto engine for hashing below, therefore out */
/* of order memory write due to compiler optimization must be prevented. */
__DSB();
base->MEMADDR = HASHCRYPT_MEMADDR_BASE(src);
base->MEMCTRL = HASHCRYPT_MEMCTRL_MASTER(1) | HASHCRYPT_MEMCTRL_COUNT(1);
__DSB();
dsb has nothing to do with preventing the compiler from reordering writes. 'dsb' will not tell the compiler to do anything - it is an instruction to the hardware. Preventing data reordering in the compiler is done with the volatile directives in the code.
Its only a comment but it should really be fixed.
I also expect that these instructions, in the way they have been added will prevent inlining which makes the code less efficient but IMPORTANTLY it will at least work
Thanks for verifying that SDK2.10.1 has resolved the issue.
And thanks for your comment. I will feed it back.
Hello again,
I noticed that there is also an SDK example that checks results. It is called lpcxpresso55s28_mbedtls_selftest. Sure enough, when I run it out of the box all tests pass. However, when I change to -O2 it then fails for SHA!
SHA-1 test #1: failed
SHA-256 test #1: failed
SHA-384 test #1: passed
SHA-384 test #2: passed
SHA-384 test #3: passed
Note SHA>256 is done in software (no hardware assist).
!!!
SDK2.9.0 is old.
Could you please test if SDK2.10.1 has the same issue?
https://mcuxpresso.nxp.com/en/select
Thanks,
Jun Zhang