Hello! My coworker and I encountered some weird behavior when attempting to implement accelerated CRC-32 using the MIMXRT1062's Data Co-Processor. We believe that we've localized the issue to a bug related to how the DCP driver handles cache alignment, but we are still a little confused by what we saw.
The issue originally manifested as, we would get CRC failures in our production application under specific conditions when it attempted to CRC a large block of memory. The CRC hashing always worked fine under unit tests, but if we reproduce a specific sequence of operations we can trigger it to fail reliably in our application. Currently I'm unsure if it was failing with an error code or if it was returning the wrong CRC value, because the way it occurred made it hard to debug. We will continue looking into that this week.
Thankfully, my coworker managed to find a band-aid solution to the problem: attaching an alignas(32) declaration to our dcp_hash_ctx_t instance (which is stored in DTCM). With this in place, CRCs worked reliably and we could no longer reproduce the problem.
To understand the issue better, we took a look inside the HAL's DCP driver and tried to understand what was going on with the alignment stuff in there. What we found was... issues. Many issues. I will list out what we found in order of least to most serious.
This function is designed to look forward from an address to find the next cache line. It works, but it's terribly inefficient, especially for a function called EVERY time a DCP_HASH function is used -- it uses a loop to do what should be a constant-time operation.
I wrote a new version for you which uses multiplication & division operations to accomplish the same result. (bitwise operations could also be used but are less clear, and these are multiplies/divides by powers of 2 so the compiler will optimize into bitwise ops).
/**
* @brief Advances \c dcpWorkExt forward until a cache line boundary is reached.
* @return Cache-aligned pointer between 0 and FSL_FEATURE_L1DCACHE_LINESIZE_BYTE-1 bytes ahead of \c dcpWorkExt
*/
static inline uint32_t *DCP_FindCacheLine(uint8_t *dcpWorkExt)
{
// Use integer division to divide the address down to the cache line size, which
// rounds to the cache line before the given address.
// So that we always go one cache line back even if the given address is on a cache line,
// subtract 1.
ptrdiff_t prevCacheLine = ((ptrdiff_t)(dcpWorkExt - 1)) / FSL_FEATURE_L1DCACHE_LINESIZE_BYTE;
// Now we just have to multiply up again to get an address (adding 1 to go forward by 1 cache line)
return (uint32_t *)((prevCacheLine + 1) * FSL_FEATURE_L1DCACHE_LINESIZE_BYTE);
}
By all means, please take this one and replace the current implementation with it!
The code in DCP_HASH_Init() contains the following assertion:
BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= sizeof(dcp_hash_ctx_internal_t), dcp_hash_ctx_t_size);
However, this is incorrect for the case where DCACHE is enabled, as the FindCacheLine function could potentially advance the context pointer by up to 31 bytes. Honestly, we all got kinda lucky that dcp_hash_ctx_t (256 bytes) is >32 bytes larger than dcp_hash_ctx_internal_t (188 bytes); if that wasn't the case then you could potentially get out-of-bounds writes depending on how far it has to advance to find a cache line -- not a Heisenbug I want to deal with!
To fix the issue, I believe that this assertion should be used instead:
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= (sizeof(dcp_hash_ctx_internal_t) + FSL_FEATURE_L1DCACHE_LINESIZE_BYTE - 1), dcp_hash_ctx_t_size_too_small);
#else
BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= sizeof(dcp_hash_ctx_internal_t), dcp_hash_ctx_t_size_too_small);
#endif
The most serious issue we found affects the DCP_HASH_Init(), DCP_HASH_Update(), and DCP_HASH_Finish() functions. All of those functions contain the following code which seems to want to align the dcp_hash_ctx_internal_t:
/* Align structure on DCACHE line*/
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
ctxInternal = (dcp_hash_ctx_internal_t *)(uint32_t)((uint8_t *)ctx + FSL_FEATURE_L1DCACHE_LINESIZE_BYTE);
#else
ctxInternal = (dcp_hash_ctx_internal_t *)(uint32_t)ctx;
#endif
Unfortunately this just... doesn't work. At all. It just increments the passed pointer by 1 cache line size, rather than actually doing anything to align it. So, for all this time, the hash functions have not actually been working with cache aligned buffers unless the dcp_hash_ctx_t structure itself is cache aligned. This seems to partially explain why my coworker's fix worked, as it means that the structure wasn't actually aligned until we did so manually.
We believe that the correct code would likely be:
/* Align structure on DCACHE line*/
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
ctxInternal = (dcp_hash_ctx_internal_t *)DCP_FindCacheLine((uint8_t *)ctx);
#else
ctxInternal = (dcp_hash_ctx_internal_t *)(uint32_t)ctx;
#endif
While we seem to have worked around the issue in the short term, I feel like we still need more information before we can consider this closed. CRCs using the DCP will be a key part of our application, and it would be very difficult to de-couple them this later if we discover that CRCs are not being done reliably. So, we need to be absolutely sure that the issue we saw is fixed by these changes before we can move forward (which is difficult because we can't reproduce the problem in tests).
So, I have two questions which I'd really appreciate answered:
Thanks for reading through all this and and I hope that you can help us with these questions! Thank you so much!
I am sorry for not reaching you out sooner.
This issue requieres further investigation from our end. So, I still can not provide an answer to your questions at this moment. I am glad to know that you are reporting code improvements to prevent part of this issue for happening.
May I ask for more details to replicate this issue? How can we modify the DCP demo from the SDK to get this replicated with our evk? Asking this just to facilitate the diagnosis of the error and report this to the error team.
All the best,
Diego
Hi Diego,
Thanks for getting back to us, I know it's a lot to go through! Unfortunately, after further testing, we weren't able to reliably reproduce the issue. And once we merged in the fixes in this post, we have not seen it happen again. I know that isn't proof positive that they work, but I do think that these changes ought to be made in the driver.
Anyway, thanks for passing on these fixes!
Jamie