i.MX7D: atomic compare and swap instructions don't work with cache

ryanschaefer · ‎09-11-2017

Hi,

Recently I've encountered an issue with atomics and the cache. When data is in a cacheable region (tested with both DDR and OCRAM), any atomic read-modify-write operation (atomic_compare_exchange_strong, fetch_add, etc.) doesn't operate as expected. I've attached an example that illustrates the problem (as well as a patch that applies to the NXP provided FreeRTOS BSP 1.0.1). The inconsistency appears only with read-modify-write operations, which use Load-Link and Store Conditional instructions. If only atomic loads and stores are used, the program functions as expected.

Here is the output from the example attached. The example was tested on an i.MX7DSabreSD Rev D. Board:

state intial value: 3
state after store 1: 1
CAS failed: Value is 3741011900, expected 1
Pushing Cache
CAS Passed
state after CAS (should = 2): 1
Invalidating Cache
state after invalidate (should = 2): 2

In the output above, you can see that the CAS doesn't see the initial value `3` or the stored value `1` (and instead sees 3741011900). After pushing the cache, the next CAS sees the correct value of '1', and supposedly stores a 2. The next load though, doesn't see a 2, but sees a stale value of 1. If the cache is invalidated, so that the load value is loaded from DDR or OCRAM, the correct value of 2 is read.

Is there some configuration that needs to be done in order to for caching to work with atomics? It seems as if Load-link and Store-Conditionals are going around the cache to either DDR or OCRAM.

Thanks,

Ryan

Original Attachment has been moved to: 0001-Example-of-Atomic-CAS-not-functioning-with-cache.patch.zip

Original Attachment has been moved to: main.c.zip

CarlosCasillas · ‎09-19-2017

Hi Ryan,

Is this behavior happening on the M4 core, isn't it?

Could you please describe a recreation scenario?

Best regards!

/Carlos

ryanschaefer · ‎09-21-2017

Hi Carlos,

In there original post the attached main.c gives a concise recreation of the issue. You are correct, it is for the M4.

Thanks,

Ryan

CarlosCasillas · ‎09-22-2017

Hi Ryan,

Sorry for the missing details, however, I internally escalated your question before mi first reply. AE team still requires additional details. So, may you please specify if you are running under ARMGCC or DS5?

Could you please also list the command lines that you are using to load the binary into DDR and OCRAM?

I will be waiting for your reply.

Best regards!

/Carlos

ryanschaefer · ‎09-25-2017

Carlos,

I am using ARMGCC with gcc 5.4.1. I am loading the binary into DDR/OCRAM from u-boot with the directions specified in the Getting_Started_with_FreeRTOS_BSP_for_i.MX_7Dual manual. For DDR, I am using the cacheable region of memory as the link and load address at 0x80000000 when loading into DDR. Loading the binary is accomplished by running `run m4boot` at the u-boot prompt.

OCRAM:

=> print m4boot
m4boot=run loadm4image; dcache flush; bootaux ${m4addr}
=> print loadm4image
loadm4image=fatload mmc ${mmcdev}:${mmcpart} ${m4addr} ${m4image}
=> print m4image
m4image=hello_world_ocram.bin
=> print m4addr
m4addr=0x00910000

DDR:

=> print m4boot
m4boot=run loadm4image; dcache flush; bootaux ${m4addr}
=> print loadm4image
loadm4image=fatload mmc ${mmcdev}:${mmcpart} ${m4addr} ${m4image}
=> print m4image
m4image=hello_world_ddr.bin
=> print m4addr
m4addr=0x80000000

Thanks,

Ryan

CarlosCasillas · ‎09-29-2017

Hi Ryan,

I have received an update from AE team. Below you can check the details:

It seems that the cache is not being enabled for these memories. At least it is not being enabled in the sent patch.

Please see in the i.MX7D RM:

"ARM Platform and Debug" -> "ARM Cortex M4 Platform (CM4)" -> "LMEM Function" -> 4.2.9.3.5 "Cache Function":

To use cache, user needs to configure MPU to set those memories as cacheable and all the other memories set as non-cacheable.

So, could you please verify if the cache is being enabled as described above?

Also, may you please specify which silicon revision are you using?

The silicon revision can be seen in the U-Boot log:

>>CPU: Freescale i.MX7D rev1.2 1000 MHz (running at 792 MHz)

Hope this will be useful for you.

Best regards!

/Carlos

ryanschaefer · ‎09-29-2017

The caches are enabled by the board support package. The entry point is not main(), but actually platform/devices/MCIMX7D/startup/gcc/startup_MCIMX7D_M4.s. This calls SystemInit() in platform/devices/MCIMX7D/startup/system_MCIMX7D_M4.c, which enables the caches, before entering main().

We are using rev1.2 Silicon.

Thanks,

Ryan

CarlosCasillas · ‎10-10-2017

Hi Ryan,

The following tests were performed by AE team:

disabling cache for the unused regions and enabling cache only for the desired region (DDR OR OCRAM);
moving "m_interrupts", "m_text" to uncacheable regions;
moving only "m_data" to the cacheable region;
enabling system cache and disable code cache (and vice versa).

The LL/SC operations seems to work only when "m_data" is at an uncacheable region.

Considering this, it seems that LL/SC operations should be going around to DDR and OCRAM as customer described. LL/SC operations are not atomic from this point of view (they are not executed in just a single instruction), so maybe the customer should stick with the atomic loads and stores.

Some information was also taken from this thread: IMX7 M4 caching and execution speed

"As it turns out, the M4 cache has been optimized for qspi operation and does not have a performance effect on ddr memory accesses. Basically the cache-able memory does not include the ddr. And therefor there will be no difference in applications operating from ddr with and without the caches turned on."

So maybe caching is not the way to go.

It is also suggested posting this on the Arm Community as this is more related to the ARM architecture then to an i.MX or NXP BSP feature so maybe in the ARM community this will be answered faster. This seems to be really low level programming and maybe they will have a better understanding at this.

Hope this will be useful for you.
Best regards!
/Carlos
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

ryanschaefer · ‎10-11-2017

LL/SC operations are not atomic from this point of view (they are not executed in just a single instruction), so maybe the customer should stick with the atomic loads and stores.

I'm not quite sure I understand this sentence. Load-Link and Store-Conditionals are necessary building blocks for any non-trivial program. Yes, they are two instructions, so they aren't atomic in that manner. But on the other hand, when used together, they are the mechanism which enables an atomic increment. Without LL/SC, an atomic increment is provably impossible with only atomic loads and stores. See Read-modify-write - Wikipedia

Some information was also taken from this thread: IMX7 M4 caching and execution speed
"As it turns out, the M4 cache has been optimized for qspi operation and does not have a performance effect on ddr memory accesses. Basically the cache-able memory does not include the ddr. And therefor there will be no difference in applications operating from ddr with and without the caches turned on."
So maybe caching is not the way to go.

The IMX7 Reference Manual is in conflict with this post, and states that DDR first 2M is supported (pg. 265). Also, it has been shown that enabling the cache for use with DDR greatly improves performance. See i.MX 7 Cortex-M4 memory locations and performance. This blog shows a x20 increase in performance in DDR Cache vs Non-Cache. We have reproduced these results.

It is also suggested posting this on the Arm Community as this is more related to the ARM architecture then to an i.MX or NXP BSP

It looks like someone with the same issue has already posted to the Arm Community Cortex M4: Atomic and Cache - Arm Community. From my understanding, the cache and memory controller logic is external to the M4 IP, and is a part of NXP's domain, but hopefully the ARM Community has some insights to help us work to a solution.

Thanks,

Ryan

CarlosCasillas · ‎10-17-2017

Hi Ryan,

Below you can find the response from R&D:

After reproducing the issue and performing some tests, it was found that the issue is because “LDREX” and “STREX” instructions overlooked LMEM cache. That means those instructions always access external memory directly, which leads to data inconsistency.
There’s no SW configuration to make the cacheable data consistent with those atomic instructions, and design team will fix it in later CM4 integration.

The possible SW workaround is to add a section in TCM (in linker file) and define atomic variables in that section. Thus “LDREX” and “STREX” will always access non-cacheable memory.

Hope this will be useful for you.
Best regards!
/Carlos
-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

ryanschaefer · ‎11-02-2017

Hi Carlos,

Can you make sure that this gets included in the Errata?

Thanks,

Ryan

ryanschaefer · ‎09-20-2018

CarlosCasillas‌ As others have pointed out, it has been a year since NXP has been notified of this bug. This needs to make it into the errata. Who do I need to reach out to to make sure that it is properly documented, if no one in this thread can escalate this?

TomE · ‎09-20-2018

> Who do I need to reach out to to make sure that it is properly documented, if no one in this thread can escalate this?

I suspect it might need a large manufacturer with a very large order book for NXP for them to assign new resources (people) to update old documents for current products.

From their perspective, maybe this problem isn't affecting anyone "important enough". It didn't get into whatever ticketing system they're using to track these issues.

Tom

CarlosCasillas · ‎10-01-2018

Hi,

The issue is still being internally attended, and a document is being prepared to be published soon.

Best regards!

/Carlos

nicotruc · ‎02-27-2019

Hi,

CarlosCasillas‌, has the document been published yet ? Where can we find it ?

Best Regards,
Nicolas

CarlosCasillas · ‎03-06-2019

For anyone asking for the availability of the document, it is not available yet, and there is not an ETA. In case of requiring early access to the preliminary document, you should contact your local Sales/FAE to request/gain access to the following space:
https://community.nxp.com/docs/DOC-340245

Hope this will be useful for you.

Best regards!

/Carlos

dry · ‎09-20-2018

Was this noted in any errata, and is this fixed now?

TomE · ‎09-20-2018

D. Ry asked:

> Was this noted in any errata, and is this fixed now?

Why don't you look?

Happy Birthday! It has been exactly 12 months since this was reported.

There are two Errata documents for this CPU on NXP's site:

- Mask Set Errata for Mask 3N09P: 8 Apr 2018

- Mask Set Errata for Mask 2N09P: 8 Jun 2017

The later one lists changing e11166, e10574 and e7805.

e11166: OCRAM: The first 4K of OCRAM (0x910000 - 0x910fff) is not available during boot time

e10574: Watchdog: A watchdog timeout or software trigger will not reset the SOC

e7805: I2C: When the I2C clock speed is configured for 400 kHz, the SCL low period violates the I2C spec of 1.3 uS min

So there's still no mention of this problem in either of them. No match on "atomic", "swap" and nothing relevant I can find matching "cache".

There's something that makes this harder to prove. The two Errata documents are for two different mask revisions of the chip. There's nothing in either of them that lists which errata items have been FIXED in the later mask. You have to manually compare the two documents to see what might have gone away. For instance e10728 isn't listed in the later mask, so I guess that has been fixed. That also means that the 2N09P document would need a new revision to list this problem if it is present in that version.

It seems that "fixing it in later CM4 integration" didn't include putting a fix into 3N09P.

And a "request" to add it to the Errata wasn't enough.

Tom

dry · ‎09-20-2018

Hi Tom,

In fact I did look, and as I didn't see any mentioning of this (atomic, cache ..), I thought may be information is elsewhere.

Thus the question here, thought it was a good place to double-check.

Hope that make sense :smileyhappy:

... You have to manually compare the two documents to see what might have gone away. ......And a "request" to add it to the Errata wasn't enough.

Yep, this is nasty

ryanschaefer · ‎10-17-2017

Hi Carlos and R&D,

Thank you for helping to identify the root cause.

Will the fix be implemented in the next revision of the i.MX7D? Is there a time line for when the next revision will be available?

Thanks,

Ryan

TomE · ‎10-18-2017

If they were going to fix it in the next mask version (and if there is going to be a new mask version) then I'd expect the design team to have said "we'll fix it in the next mask version".

Instead they said:

design team will fix it in later CM4 integration

That may be a strange choice of words, but without clarification I'd read that to mean "in the next product", as in "when we integrate the CM4 (ARM M4 module) into a different product".

Looking to history for answers, the memory throughput on the i.MX6D and i.MX6Q is half of what it should be due to "ERR003740", as detailed here:

https://community.nxp.com/thread/329671

That was never fixed in those chips. It was fixed in the i.MX6DualPlus and i.MX6QuadPlus chips. The same may happen with this problem. There may need to be a "plus" version that has less "minuses" than the previous one.

The people that write the errata have an interesting view on what their chips are used for, which may affect the importance giving to particular errata. Check out

https://www.nxp.com/docs/en/errata/IMX7D_2N09P.pdf

e6939: Core: Interrupted loads to SP can cause erroneous behavior
Description: ARM Errata 752770: Interrupted loads to SP can cause erroneous behavior

If the Stack Pointer is being written to and an interrupt happens, the stack pointer can get corrupted. But the "Workaround" says:

Most compilers are not affected by this, so a workaround is not required.

Compilers may not mess with the stack pointer, but all threaded and multitasking operating systems do! Likewise "most compilers" don't generate those lock instructions you're having trouble with, but smarter multitasking systems, threading systems and operating systems do.

Tom

i.MX7D: atomic compare and swap instructions don't work with cache

i.MX7D: atomic compare and swap instructions don't work with cache

i.MX7Dual