i.MXRT1060 SEMC SDRAM Data Corruption

TomE · ‎10-19-2020

We are having apparent SDRAM Corruption problems. These are very intermittent, usually once every few HOURS of running. We are wondering if this is happening to anyone else, or if it matches any known problems.

We have this happening on our boards, but have been able to get it to fail on the NXP MIMXRT1060-EVK board as well.

We have an 8MiB DRAM at 0x80000000-0x80800000 and have configured the MPU to overwrite the default WT cache attribute as WBWA to avoid ARM errata #1259864. So we're using "Writeback" mode rather than the default "Writethrough".

The problem seems to be that sometimes data fails to be written to SDRAM. We have full traces (using a high speed debug pod with instruction and data trace) showing "0" being written to and read from an SDRAM location and 20,800 trace lines later, a different value is written, but then the location is read back as "0". This tracing is from the CPU's view, and can't show if and how that data got flushed from the cache back to the SDRAM, or if it was written back at all.

We have checked and changed all the SEMC SDRAM timing parameters, making them tighter, and then more lenient. We've used the NXP SDK DCD SEMC settings. We've changed the port settings to higher and lower impedances, and with different slew rates. None of those changes affect this corruption.

We are not running the SDK sample code it can't support the simultaneous operations we require. We are transferring 5MB/second over USB and writing to an EMMC at the same time, as well has having the CPU about 70% busy. The code is in SDRAM as is most of the data. We are using the I-TCM, D-TCM and OCRAM2 as well.

The USB and EMMC are using their own DMA to and from the SDRAM, overlapping with the CPU. Our drivers are performing all the Cache Flush and Invalidation operations required for this.

This failure can be made to happen more often by increasing the SEMC SDRAM refresh rate to very high rates. It fails every few minutes when configured like that.

We have found two ways to stop the corruption.

Setting the fn_mod register of the m_b_0 (Cortex-M7) port of the NIC-301 interconnect to limit the write
issuing capability to one (write '2' to address 0x41442108) appears to stop it from happening.

Setting the "DISRAMODE: Disables dynamic read allocate mode for Write-Back Write-Allocate memory regions" in the ARM CPU also stops it. This is achieved by setting 0xe000e008 to 0x00001800.

Both of these affect the maximum rate that the CPU can burst to the SDRAM, and it is this high burst rate (from the CPU through the caches, the AXI, NIC, SEMC and to the SDRAM) that seems to be triggering this problem. Increasing the SDRAM Refresh rate probably applies "backpressure" on the memory system and the write pipeline.

We have this failing on the EVK, but there's no way we could generate a program that demonstrates this based solely on the SDK code, so please don't ask for that.

Tom

Hui_Ma · ‎10-26-2020

Hi Tom,

Thanks for the patience.

I think you already get attached explaination from our local i.MXRT product team.

The SEMC module has re-order feature, which could cause issue when multiple AXI masters accessing the SDRAM with large/burst data operation scenario (back to back operation).

Please check attached pdf file for the detailed info.

About BMCRx registers value, we had submitted a request to SDK software team to change both BMCRx registers value to 0x81 to avoid the similar issue happen.

Thanks for the attention.

Mike

在原帖中查看解决方案

iblackfin · ‎03-27-2024

I've spent 2 weeks fighting with the intermittent faults, brushing up code and etc. All get sorted once I found this thread. It is not great that NXP is still not having the issue published in Errata ....

davisf · ‎03-27-2024

@iblackfin honestly it's pretty awful that NXP has not released an official errata and updated the example code here.

This has cost probably hundreds of thousands of dollars of effort across just the people commenting in this thread, to say nothing of everyone who found this but didn't bother to make an account and comment.

TomE · ‎10-20-2020

That App Note doesn't mention DMA at all.

The latest Errata (Chip Errata for the i.MX RT1060 (REV 1.1)) doesn't detail any Core/DMA conflict. It only mentions SEMC/NAND problems.

Is there Errata or a document detailing this problem?

Is this a problem in the CPU, the NIC-301 or the SEMC?

The DMA and M7 access looks to be mediated by the NIC-301. Are there any register changes in there that might help? I've already swapped the CPU and DMA priority, and it didn't seem to help.

There are also registers in the SEMC (BMCR0, BMCR1) that look to control 8 entries command reordering in Queue B; QoS priority, latency and efficiency adjustable arbitration scheme.". There's nothing in the Reference Manual detailing what they do or how to use them. There's nothing in the App Note you referenced that does either. Is anything available detailing what these do and how to set them? Could they change this problem?

Disabling data cache on the SDRAM would slow the system down terribly. We've already found two different ways to "reduce the bandwidth" that makes the problem go away, but we'd prefer a fix that doesn't slow the system down.

Tom

Hui_Ma · ‎10-20-2020

Hi Tom,

I double checked with local i.MX RT product team about this issue.

There with below suggestion you could try at first:

Please try to change SEMC registers register BMCR0 and BMCR1 to 0x81.

Please let us know the test result. Thanks.

best regards,

Mike

TomE · ‎10-21-2020

> Please try to change SEMC registers register BMCR0 and BMCR1 to 0x81.

We tried that last night with three of our units and with the NXP Evaluation board.

All ran reliably with this modification, so that change fixes this problem.

What did that value change actually do? We would like to have some understanding of the fix.

I notice that these register values have been changed in the past. This document details a previous problem, without saying what it was and what the fix was:

https://mcuxpresso.nxp.com/api_doc/dev/1891_doc/MCUXpresso%20SDK%20Release%20Notes%20for%20EVK-MIMXR...

2.0.4
Bug Fixes
* Fixed the SEMC queueA and queueB weight configuration issue

One of the difficulties we had was that the boards could run for many hours before this problem showed up with something that we noticed, usually a Crash through one of the Exceptions, or with the "Asserts" we have in the code. We have no idea how many "undetected corruptions" we were getting, if any. We had to find ways to make these errors more frequent so we could characterize them, and test changes (like the BMCRn change).

Anyone else having intermittent problems that look like ours might like to know how to make them more frequent to help with their tests.

We found that making the SDRAM Refresh extremely frequent made the SDRAM corrupt more often. Slowing the SEMC clock down also helped make it fail. We changed CCM_CBCDR[SEMC_PODF] from "2" to "7" (166MHz down to 62MHz) and changed SDRAM_CR3 (Refresh) from the usual "0x3c1e0b09" to "0x0a09010f". That is trying to trigger an 8-burst refresh every 9 clocks. That usually gets us a failure within a minute, but with the BCMRn changes it ran all night.

Tom

Hui_Ma · ‎10-21-2020

Hi Tom,

Glad to know the issue was fixed.

I am checking with i.MX RT product team about the explaination (BMCR0 and BMCR1 set to 0x81 fix the issue ).

I will update here when there with any feedback.

Thanks for the patience.

Mike

TomE · ‎10-22-2020

Could you also please advise the MCUExpresso Team to look at the values that they recommend in the SDK and to make any required changes.

I find it a little confusing as there are three very different sets of values for these registers in the SDK. To that we can add the fourth set that you have just relayed to us. The field values range widely and very different when compared to each other.

To detail these, in the "SDK_2.8.2_MIMXRT1062xxxxA.zip" file I retrieved today, there are:

213 "dcd.c" files: BMCR0 = 0x00030524, BMCR1 = 0x06030524
160 ".jlinkscript" files: BMCR0 = 0x00030524, BMCR1 = 0x06030524
10 ".mex" files: BMCR0 = 0x00030524, BMCR1 = 0x06030524
bl_semc.c: BMCR0 = 0x00404085, BMCR1 = 0x00400085
fsl_semc.c: BMCR0 = 0x00104085, BMCR1 = 0x40246085
Today's advice: BMCR0 = 0x00000081, BMCR1 = 0x00000081

(4) is middleware/mcu-boot/src/drivers/semc/bl_semc.c
(5) is devices/MIMXRT1062/drivers/fsl_semc.c

Tom

Hui_Ma · ‎10-26-2020

Hi Tom,

Thanks for the patience.

I think you already get attached explaination from our local i.MXRT product team.

The SEMC module has re-order feature, which could cause issue when multiple AXI masters accessing the SDRAM with large/burst data operation scenario (back to back operation).

Please check attached pdf file for the detailed info.

About BMCRx registers value, we had submitted a request to SDK software team to change both BMCRx registers value to 0x81 to avoid the similar issue happen.

Thanks for the attention.

Mike

davisf · ‎03-11-2024

@Hui_Ma we ran into this issue as well and it cost us about 300+ engineer-hours, almost delayed our product release and forced our team to work weekends to debug this issue. Also cc @brian14 , @JorgeCas , @Alejandro_Salas

Can NXP please, please issue an official errata for this issue, and update the example code and defaults to something that works?

This is an unbelievably hard issue to debug and the surface-level failures look like memory management issues that could be happening in other graphics/GUI-related libraries, which leads to massive distractions and non-productive investigations of a bottomless pit. We got very lucky that we began investigating the SDRAM itself and stumbled into this thread.

Thanks for surfacing this and continuing to push on it @TomE, we owe you one.

TomE · ‎03-11-2024

This bug lost us about two months back in 2020, with two engineers working on it for most of that time. When we finally had it characterised well enough (seeing memory not getting written) we got a reply from NXP about this apparently known problem.

Then nearly 3 years later, euginjeyapradee reported they'd wasted 6 months on this before finding this Community Post.

They you've just wasted over 300 hours on the same thing before finding this post.

I don't know what it would take to get an official Errata written on this. A lawsuit? A hundred million dollar customer getting annoyed?

I've been trying to get errata for problems Motorola caused in 2001 that caused me grief in 2010 with no luck.

Tom

TomE · ‎10-26-2020

Thank you.

That explanation matches the corruption we were seeing.

That and the explanation of how the workaround functions gives us confidence that this problem won't come back.

Tom

TomE · ‎10-28-2020

When the SEMC is used as intended (programming the queues for "best operation"), it performs the operations in the wrong order and corrupts the memory. It fails like this when used as intended.

It fails like this when programmed by the SDK, or using the SDK as an example. All of the example code that is "out there" has it enabled.

The workaround effectively disables the function of the queue.

This matches the criteria for documenting this as an Errata Item. Detail the problem, give the workaround and document which version of the SDK has the workaround applied.

I would hope to see an updated Errata item soon.

Tom

TomE · ‎03-30-2021

It is now about 5 months later, and there's still no Errata addressing this issue. The latest Errata for this part was issued on 3 November 2020 (OK, so only a few weeks after this was documented here), but there hasn't been another one yet documenting this problem.

Tom

TomE · ‎09-14-2021

It is now 11 months later and there's still no published Errata for this problem.

Does anyone know if the sample code has been fixed yet?

Tom

TomE · ‎11-27-2022

25 months now. I've read through "Chip Errata for i.MX RT1060_A" and "Chip Errata for i.MX RT1060_B" and there's no mention of this problem.

Does anyone know if the sample code has been changed? Are there any App Notes detailing SEMC programming?

Tom

euginjeyapradee · ‎06-09-2023

Tom,

We have faced the same problem with imxRT1050 and have been chasing the issue for more than 6 months and we were supposed to launch the product by 2022 December. The software suddenly crashes after 7 days and the crash would happen in a Crank GUI framework library. We have spent quite a lot time to narrow down the problem to the library and thought that the library is corrupting the SDRAM. We have also tried reducing the SDRAM clock and also altering the SDRAM timing to provide more clock window. THe issue seem to quickly occur when we do all that. So we even wrote SDRAM test software to test the SDRAM and we werent getting any problem when running the SDRAM test software. Now i understand why SDRAM test was not failing, as the stress would be much lesser compared to the actual software that does many more things.

Recently came accross your post and that looked promising and have tried the same solution that was proposed in your post and was able to resolve the issue

Changing the AXI bus Queuing/reordering scheme should not corrupt the SDRAM. The intention of having those settings is for optimizing the performance. So if some re-ordering scheme is causing a memory corruption, then there is a problem in the logic that handles the queuing scheme. So this calls for assigning an Errata to this issue and having that would have helped us a lot in resolving the issue. We got to a point to scrap this project and the timely help from your data in the forum helped us a lot in resolving this issue. We were chasing down the problem using the dcd configuration files available at several places in the SDK and we weren't knowing which one is correct.

I would still suggest that this should become part of Errata as it is a chip related issue in its queuing mechanism between AXI BUS and Semc controller, so that anyone facing the problem should be able to quickly refer the errata and the proposed solution in it,

TomE · ‎06-09-2023

I'm glad my posts have been able to help you.

You wrote:

Changing the AXI bus Queuing/reordering scheme should not corrupt the SDRAM. The intention of having those settings is for optimizing the performance. So if some re-ordering scheme is causing a memory corruption, then there is a problem in the logic that handles the queuing scheme.

I previously wrote:

When the SEMC is used as intended (programming the queues for "best operation"), it performs the operations in the wrong order and corrupts the memory. It fails like this when used as intended.

There's no "logic fault". The SEMC read and write operations come through different "pipes" into the SEMC. When programmed "by default" and as all the SDK code does, the READ pipe seems to have a higher priority than the WRITE pipe. So this has the CPU reading stale data, and crashing your complicated (and busy) system and our complicated (and busy) system. But not any of the simple demo programs that NXP publish and run on this chip. Technically they just have to fix their SDK. Fixing the manual to say how this device works would help too. But since this has been "wrong everywhere", I agree that the only way to stop us CUSTOMERS from finding this product just doesn't work is to document it somewhere we are likely to find in, meaning as an Errata. Otherwise, as you nearly found, we stop using that chip any anything else from NXP on the back of that bad experience.

You'll notice I've been asking for an Errata on this for years. My current record in not getting a fix relates to a pair of problems with the "PIT" peripherals in the Coldfire range where the "bug" (bad documentation, bad sample code) goes back to last millennium with the PIT in the MMC2107. I've been trying to get an Errata on this since 2010:

https://community.nxp.com/t5/ColdFire-68K-Microcontrollers/PIT-hw-boo-boo-Read-if-you-need-accurate-...

I've had a other problems with this chip and the example software. We had to rewrite the USB drivers from scratch. The Quad Timer code doesn't work very well, and the module dates back to 2007 or 2012:

https://community.nxp.com/t5/i-MX-RT/i-MXRT-Quad-Timer-QTMR-Are-There-Any-Working-Code-Examples/m-p/...

https://community.nxp.com/t5/i-MX-RT/Has-anybody-used-Timer-Overflow-Flag-with-the-QTMR/m-p/920458

The Manuals document registers in the Flexcan modules that they don't have (this should have been fixed by now, but I haven't checked). Basically the FlexCAN chapters documented a completely different model of the hardware than is in the chip:

https://community.nxp.com/t5/i-MX-RT/IMXRT1062-Hardfault-Reading-CAN3-ERFCR-Register/m-p/906147#M313...

LPSPI sample code takes 96us to set the baud rate!

https://community.nxp.com/t5/i-MX-RT/MIMXRT1051CVL5B-manual-for-LPSPI/m-p/844349#M1703

Then there's the ongoing minor annoyance of all of the spelling errors that their editing system should be catching, but doesn't: I've spell-checked the IMXRT manuals and they have the same mistakes as the i.MX6 ones do, like "writting", "wrotten" (but not "wrutten" :-). That sort of thing makes me suspicious of the technical content:

https://community.nxp.com/t5/i-MX-Processors/i-MX53-i-MX6-Reference-Manual-Spelling-Problems/m-p/341...

Tom

euginjeyapradee · ‎06-10-2023

Yes. Now i understand. I have read through RT106x SEMC BMCR register issue manual attached. It gives a better explanation on the issue. If I look at the fix, they are basically disabling re-ordering if I rightly understand, where in these registers were meant for that. So, with re-ordering enabled through WRWS, write and read request for the same address is placed in the queue by d-cache, re-ordering could happen due to several other pending requests and there is a possibility of d-cache read happening first before write and it crashes, as write should have happened before read. I believe this is a flaw and should have been handled appropriately when d-cache is involved when re-ordering is enabled. This entitles for an Errata.

I had a problem where USB was failing in the middle when we used the USB host mass storage class drivers to write data on to a USB flash drive. But USB buffers were using non-cacheable section of memory in the SDRAM and there should not be d-cache involvement when writing/reading to/from the buffers. The only way i could fix the problem was to move the USB buffers to the internal RAM (DTCM) and allocate the memory properly with proper alignment through heap4

TomE · ‎06-18-2023

> I had a problem where USB

Make sure that wasn't due to this at least 15 year old documentation problem with all of the USB modules:

https://community.nxp.com/t5/i-MX-RT/ATDTW-Bit-in-USB-Command-Register-again/td-p/798908

Note that it was wrong before 2008, corrected for some devices, then somehow wrong again in all new manual revisions for a very long time from 2010, 2012 and later. It is correct in the current (and 2018) IMXRT1060 Reference Manual, but since it was wrong for so long, software has been written based on the wrong definitions.

Tom

Hui_Ma · ‎10-20-2020

Hi Tom,

There is AN12437 about i.MX RT series performance optimization.

The SDRAM Data corruption during write opreation was caused by ARM Cortex M7 core and DMA existing conflicts to write SDRAM at same time. Customer slow down the ARM Cortex M7 core write bandwidth via Setting the fn_mod register of the m_b_0 (Cortex-M7) port of the NIC-301 interconnect to limit the writeIf could reduce the conflicts possibility.

If customer could try to disable DCache of SDRAM memory range and check if the issue could be fixed?

Thanks for the attention.

Mike