DCP Hashing Cache Alignment Issues + Potential HAL Bugs

MultipleMonomials · ‎09-18-2023

Hello! My coworker and I encountered some weird behavior when attempting to implement accelerated CRC-32 using the MIMXRT1062's Data Co-Processor. We believe that we've localized the issue to a bug related to how the DCP driver handles cache alignment, but we are still a little confused by what we saw.

The issue originally manifested as, we would get CRC failures in our production application under specific conditions when it attempted to CRC a large block of memory. The CRC hashing always worked fine under unit tests, but if we reproduce a specific sequence of operations we can trigger it to fail reliably in our application. Currently I'm unsure if it was failing with an error code or if it was returning the wrong CRC value, because the way it occurred made it hard to debug. We will continue looking into that this week.

Thankfully, my coworker managed to find a band-aid solution to the problem: attaching an alignas(32) declaration to our dcp_hash_ctx_t instance (which is stored in DTCM). With this in place, CRCs worked reliably and we could no longer reproduce the problem.

To understand the issue better, we took a look inside the HAL's DCP driver and tried to understand what was going on with the alignment stuff in there. What we found was... issues. Many issues. I will list out what we found in order of least to most serious.

Inefficient DCP_FindCacheLine() Implementation

This function is designed to look forward from an address to find the next cache line. It works, but it's terribly inefficient, especially for a function called EVERY time a DCP_HASH function is used -- it uses a loop to do what should be a constant-time operation.

I wrote a new version for you which uses multiplication & division operations to accomplish the same result. (bitwise operations could also be used but are less clear, and these are multiplies/divides by powers of 2 so the compiler will optimize into bitwise ops).

/**
 * @brief Advances \c dcpWorkExt forward until a cache line boundary is reached.
 * @return Cache-aligned pointer between 0 and FSL_FEATURE_L1DCACHE_LINESIZE_BYTE-1 bytes ahead of \c dcpWorkExt
 */
static inline uint32_t *DCP_FindCacheLine(uint8_t *dcpWorkExt)
{
    // Use integer division to divide the address down to the cache line size, which
    // rounds to the cache line before the given address.
    // So that we always go one cache line back even if the given address is on a cache line,
    // subtract 1.
    ptrdiff_t prevCacheLine = ((ptrdiff_t)(dcpWorkExt - 1)) / FSL_FEATURE_L1DCACHE_LINESIZE_BYTE;

    // Now we just have to multiply up again to get an address (adding 1 to go forward by 1 cache line)
    return (uint32_t *)((prevCacheLine + 1) * FSL_FEATURE_L1DCACHE_LINESIZE_BYTE);
}

By all means, please take this one and replace the current implementation with it!

Incorrect assertion in DCP_HASH_Init()

The code in DCP_HASH_Init() contains the following assertion:

BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= sizeof(dcp_hash_ctx_internal_t), dcp_hash_ctx_t_size);

However, this is incorrect for the case where DCACHE is enabled, as the FindCacheLine function could potentially advance the context pointer by up to 31 bytes. Honestly, we all got kinda lucky that dcp_hash_ctx_t (256 bytes) is >32 bytes larger than dcp_hash_ctx_internal_t (188 bytes); if that wasn't the case then you could potentially get out-of-bounds writes depending on how far it has to advance to find a cache line -- not a Heisenbug I want to deal with!

To fix the issue, I believe that this assertion should be used instead:

#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
    BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= (sizeof(dcp_hash_ctx_internal_t) + FSL_FEATURE_L1DCACHE_LINESIZE_BYTE - 1), dcp_hash_ctx_t_size_too_small);
#else
    BUILD_ASSURE(sizeof(dcp_hash_ctx_t) >= sizeof(dcp_hash_ctx_internal_t), dcp_hash_ctx_t_size_too_small);
#endif

Incorrect cache alignment method in DCP_HASH functions

The most serious issue we found affects the DCP_HASH_Init(), DCP_HASH_Update(), and DCP_HASH_Finish() functions. All of those functions contain the following code which seems to want to align the dcp_hash_ctx_internal_t:

    /* Align structure on DCACHE line*/
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
    ctxInternal = (dcp_hash_ctx_internal_t *)(uint32_t)((uint8_t *)ctx + FSL_FEATURE_L1DCACHE_LINESIZE_BYTE);
#else
    ctxInternal                     = (dcp_hash_ctx_internal_t *)(uint32_t)ctx;
#endif

Unfortunately this just... doesn't work. At all. It just increments the passed pointer by 1 cache line size, rather than actually doing anything to align it. So, for all this time, the hash functions have not actually been working with cache aligned buffers unless the dcp_hash_ctx_t structure itself is cache aligned. This seems to partially explain why my coworker's fix worked, as it means that the structure wasn't actually aligned until we did so manually.

We believe that the correct code would likely be:

    /* Align structure on DCACHE line*/
#if defined(__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U) && defined(DCP_USE_DCACHE) && (DCP_USE_DCACHE == 1U)
    ctxInternal = (dcp_hash_ctx_internal_t *)DCP_FindCacheLine((uint8_t *)ctx);
#else
    ctxInternal                     = (dcp_hash_ctx_internal_t *)(uint32_t)ctx;
#endif

Questions Remaining

While we seem to have worked around the issue in the short term, I feel like we still need more information before we can consider this closed. CRCs using the DCP will be a key part of our application, and it would be very difficult to de-couple them this later if we discover that CRCs are not being done reliably. So, we need to be absolutely sure that the issue we saw is fixed by these changes before we can move forward (which is difficult because we can't reproduce the problem in tests).

So, I have two questions which I'd really appreciate answered:

Could someone confirm whether the three things we posted are actually issues in the DCP HAL driver and, if they are, that they will be fixed in a future release?
Could someone comment on whether the issue we saw (failed CRC operations) could actually be caused by the effect of these bugs (internal hash context is not aligned)? I wasn't able to find any obvious reason *why* those two things would be connected, despite comments and another forum post which indicate that they are. It's especially puzzling because the DCP context is located in DTCM and therefore shouldn't need explicit cache management (right?). Does the DCP just assume things are cache aligned without stating that anywhere?

Thanks for reading through all this and and I hope that you can help us with these questions! Thank you so much!

diego_charles · ‎10-02-2023

Hi @MultipleMonomials

I am sorry for not reaching you out sooner.

This issue requieres further investigation from our end. So, I still can not provide an answer to your questions at this moment. I am glad to know that you are reporting code improvements to prevent part of this issue for happening.

May I ask for more details to replicate this issue? How can we modify the DCP demo from the SDK to get this replicated with our evk? Asking this just to facilitate the diagnosis of the error and report this to the error team.

All the best,

Diego

MultipleMonomials · ‎10-04-2023

Hi Diego,

Thanks for getting back to us, I know it's a lot to go through! Unfortunately, after further testing, we weren't able to reliably reproduce the issue. And once we merged in the fixes in this post, we have not seen it happen again. I know that isn't proof positive that they work, but I do think that these changes ought to be made in the driver.

Anyway, thanks for passing on these fixes!

Jamie

DCP Hashing Cache Alignment Issues + Potential HAL Bugs