Using DDR Octal PSRAM with the MCXN947

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Using DDR Octal PSRAM with the MCXN947

Using DDR Octal PSRAM with the MCXN947

Introduction

 

The MCXN947 is the flagship part in the MCX N family of microcontrollers. The superset part includes 2MB of internal flash memory and 512KB of internal SRAM. Additional memory can be added to an MCX N system design using the unique FlexSPI controller. FlexSPI is a unique peripheral that enables access to SPI (Serial Peripheral Interconnect) memories via the internal AHB bus. Though the term SPI often implies a single bit, synchronous serial data link,  the concept has been extended to quad and octal data paths.

Eli_H_0-1719236172370.pngThe MCX FlexSPI Peripheral

A key point about FlexSPI is that enables the SPI connected memories to appear in the system memory map. The details of the SPI transactions are managed by the FlexSPI controller. This feature enables Execute-in-Place (XIP) capability, where the CPU is executing directly from the external memory.

The FlexSPI controller has two ports which can be further subdivided into two separate interfaces allowing a maximum of 4 devices if required. SPI based memories typically are accessed in a burst mode, where a controller will read/write data in 512 bytes blocks. The block access nature of these memories can present challenge for typical MCU uses cases. The MCX N integrates a 64-bit data path, 16KB cache in front of the FlexSPI controller known as CACHE64.

Eli_H_1-1719236172573.png

The MCX CACHE64 Module

The CACHE64 provides spatial and temporary locality of FlexSPI data to the system, smoothing the block transactional nature of the external SPI memories. The cache makes SPI based memories suitable for direct execution or data storage without necessarily incurring a large performance penalty.

Many FlexSPI use cases focus on extending flash memory for non-volatile storage of code and data. There are however  SPI based PSRAM (Pseudo-Static RAM) devices in the marketplace allowing for volatile RAM to be added as well. PSRAM is an interesting blend of technologies. The core memory technology  is dynamic RAM with built in control/refresh logic. The external  interface looks like static RAM with Quad or Octal SPI data lines.

For the most timing critical operations,  low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI port can enable a large degree of application flexibility. Possible use cases PSRAM on the MCX include:

  • Large model storage for the eIQ® Neutron NPU
  • Long delay lines for digital audio algorithms such as reverbs.
  • Implementing digital audio effects such as a looper
  • Implementing large applications which can be loaded from commodity storage such as SD cards or a USB Drive
  • Storage of dynamic graphics assets such as animations and bitmaps
  • Large audio buffers for a digital sampler/synthesizer using the onboard 14-bit DAC with oversampling to 16-bit
  • Long chains of image frame buffers for the SmartDMA/EZH camera interface
  • Deep capture buffers with the onboard ADC

 

Connecting an 8MB DDR Octal PSRAM to the MCX N

 

The device used for this paper was the 8MB AP Memory APS6408L-3OBM-BA.  

 

Eli_H_2-1719236172726.png

The APS6408L PSRAM in a 6mmx8mm BGA Package

The  APS6408L-3OBM-BA is packaged in a small 6mmx8mm, 24-ball, 1mm pitch BGA. This footprint is commonly used for Octal SPI memories and there are several vendors who produce pin compatible devices. 

Octal SPI memories are DDR capable, supporting data transfers on both edges of a clock. Higher speed octal devices can support up to 400MB/s transfers with a 200MHz serial clock but generally use a lower voltage (+1.8v) interface. +3.3v based Quad/Octal devices are available but generally are limited to 133MHz serial clock rates. For this paper, we will use the APS6408L-3OBM-BA variant as it supports a 3.3v interface with a maximum clock speed of 133MHz equating to a maximum data transfer rate of 266MB/s. Note that the MCX N supports 1.8v IO to support the higher speed devices.

The FRDM-MCXN947 development board comes with a W25Q64 Quad SPI flash memory part installed. This part is packaged in a wide SOIC8 package.

Eli_H_18-1719236468149.png

The FRDM-MCXN947 Quad SPI Flash Configuration Default

However, the PCB design has pads for attaching both QuadSPI devices in a wide SOIC8 form factor or Octal SPI devices in the 6mmx8mm BGA24 form factor.

Eli_H_4-1719236175794.png

The FRDM-MCXN947 Supports Wide SOIC8 and 6mmx8mm BGA24 Device Packages

I chose the 3.3v variant of AP Memory APS6408L as it was the simplest configuration to use with the FRDM-MCXN947. Removing the W25Q64 is straightforward with use of a hot air rework tool, exposing the BGA24 pads underneath.

Eli_H_23-1719236551153.png Revealing the BGA24 pads on the FRDM-MCXN947

The APS6408L-3OBM-BA can be soldered with a hot air rework tool after adding some solder paste to the exposed pads.

Eli_H_25-1719236610924.png

The APS6408L-3OBM-BA soldered to the FRDM-MCXN947

 

Octal PSRAM Configuration and Test

 

There is currently no sample in the MCUXpresso SDK for using octal PSRAM with the MCXN947. However, the FlexSPI controller is very similar to that in the MIMXRT685.  Using a PSRAM sample from the i.MXRT685 SDK, I developed an example for the MCXN947.

https://github.com/wavenumber-eng/mcxn947_octal_psram

Inside of the repository is a project named “bunny_octal_psram_test” which can be imported into MCUXpresso IDE v11.9.0 [Build 2144] [2024-01-05] or later. This sample performs a basic memory test of the entire array as well as implementing some basic PSRAM transfer tests. The “bunny” naming convention comes from its origin in the  bringup code for the of the MCXN947 “BunnyBoard” which also uses the APS6408L-3OBM-BA.

 Eli_H_28-1719236674795.png

The MCXN947 BunnyBoard is an M.2 22mm x 22mm form factor module

The FlexSPI memory controller was designed to be a future proof interface that enables the MCX N to interface with virtually any external SPI based memory. Using a programmable look-up table (LUT) approach, the FlexSPI controller can be adapted to single bit, dual, quad, or octal/DDR memories as needed. The sample includes a simple LUT to configure for the APS6408L-3OBM-BA.

Eli_H_8-1719236203183.png

LUT configuration for the APS6408L-3OBM-BA

PSRAM devices typically require additional configuration for parameters such as read latencies and burst sizes. This sample provides some defaults that will work with the APS6408L-3OBM-BA. Control register access is performed with the FlexSPI SDK API and the custom LUT.

Eli_H_9-1719236203220.pngAccessing PSRAM internal registers over FlexSPI.

The APS6408L-3OBM-BA is specified for 3.3v operation and a maximum 133MHz clock. PLL1 was used to generate the 133MHz clock for the FlexSPI controller.

Eli_H_10-1719236203276.pngConfiguring PLL1 to generate a 133MHz clock for the FlexSPI interface

Once the PSRAM is configured,  it can be accessed through using normal memory access patterns such as with a pointer.

volatile uint32_t *psram = (volatile uint32_t *)(BUNNY_FLEXSPI_BASE_ADDRESS);

psram[0] = 0xAA551122;

It is possible to configure the build system such that the PSRAM can be used by the linker. Care must be taken to ensure that the PSRAM/FlexSPI is initialized before the standard C initialization and copy down routines. This is beyond the scope of this paper but will be a topic for a future paper.

 

PSRAM Performance Tests and Considerations

 

When using PSRAM with FlexSPI connected memories , it is important to understand how the cache and burst access nature of the PSRAM impacts system performance. Real world scenarios can often have a variety of memory access patterns making it difficult to develop a singular test to characterize performance. However, I typically will run tests that operate the memories at various limits, with the understanding that real world performance will fall somewhere in between.

For this work, there were two limiting cases that were evaluated:

  • Large Block transactions
  • Random access transactions over a wide address space

 For the block transaction test, the code would read/write block sizes from 1KB to 32KB using both DMA and memcpy. The intent of this test case was to show the limiting behavior of the CPU interacting with the CACHE64 (best case scenario).

To achieve good memcpy performance, I linked this application against the newlib library. The newlib build used with MCUXpresso has a hand tuned, assembly language implementation of memcpy that performs better than redlib or newlib-nano. You can learn more about the newlib implementation via memfault’s excellent analysis.

The block transaction test performed 32-bit reads and writes to a predetermined range of randomized addresses. This test evaluates the limiting case where the transactions would almost always fall outside of the cache. The randomized access test cases included read-only, write-only and a write-then-read transactions.

The CPU cycle counts for the core code paths performing the transfers were gauged using the ARM SysTick timer. Tests were iterated 256 times and transfer rates were computed using an average cycle count value over the iterations. The clock cycle counting method was calibrated and the results are within a reasonable margin of error (+/-2 cycles). The transfer rates were computed using the  CPU cycle count and the system clock rate.

Before measuring timings on the PSRAM, a control test was executed using a block of internal memory as a reference. Since the code paths for operating the DMA transfer, performing memcpy and implementing the random read/writes have their own performance characteristics, the control test establishing a baseline for comparison.

As stated previously, the memcpy implementation uses a hand tuned assembly from the newlib library. The DMA transfer code has minimal overhead as the timed code path initiating a DMA transfer using SDK API and polling for the result. The random read/write code is a straightforward C for loop with double indexed array access. The test code was compiled with the -02 optimization flag. The codes used for the memory transfers doesn’t represent any particular optimization or use case but are typical of what might be found in  a real-world application.

 

Test Results

 

The results shown are copies from a serial debug terminal. Data was recorded, formatted, and printed by the MCXN947 PSRAM test firmware.

Eli_H_11-1719236203323.pngThe memory control test using internal SRAM transfers

This test represents a control case as the source and destination buffers are both in internal SRAM and both of the buffers are in RAM banks on different AHB ports. There are many interesting features in the control data, but for now will consider this a baseline for how the test algorithms perform when using PSRAM.

Eli_H_12-1719236203386.pngTest Run #1 : 133MHz FlexSPI Octal PSRAM – No Cache

There were a few notable features in from test run #1 which has the CACHE64 disabled.

  • The memcpy reads were in some cases better than DMA,  some of which was to be expected from the control run data but was surprising and warrants additional investigation into the how the SDK uses the DMA controller.
  • PSRAM reads were generally much faster than writes in the block transfers. Some of this was to be expected based upon the published timing diagrams in the APS6408L-3OBM-BA datasheet, but the difference was quite remarkable and would warrant further study.
  • The random-access tests were quite slow, which was to be expected. Access random words will trigger frequently FlexSPI page transactions with the PSRAM.
Eli_H_13-1719236203446.png

Test Run #2 : 150MHz (overclocked) FlexSPI Octal PSRAM – No Cache

Test run #2 was identical to #1 except that the FlexSPI clock rate was increased to 150MHz. This is overclocking the APS6408L-3OBM-BA PSRAM which is not recommended in a production use case over the published temperature range. As expected , there was a slight increase in performance due to the faster clock.

Eli_H_14-1719236203504.png Test Run #3 : 133MHz FlexSPI Octal PSRAM – Cache Enabled

Test Run #3 returns to the 133MHz clock rate and enables the CACHE64 module.

A few notable features in this dataset:

  • The DMA read/write of characteristics match the control test. This is an indication that the CPU is primarily interacting with the cache, not the PSRAM.
  • Once the block size is larger than 16Kb,  we can see the read and write rates fall significantly. This is to be expected as this is the size of the CACHE64. When the block access is larger than 16Kb,  the FlexSPI peripheral needs to perform external access to fetch 512-byte pages (cache miss).
  • The random-access tests show that the using the cache when reads/writes constantly miss the incur a strong performance penalty.   When there is a cache miss,  the FlexSPI fetches an entire 512-byte block from the PSRAM. It is important to consider the use case to avoid this penalty.
  • Inside of the 16KB cache boundary, the random accesses performance is improved.

Eli_H_15-1719236203569.pngTest Run #4 :  150MHz (Overclocked) FlexSPI Octal PSRAM – Cache Enabled

Test Run #4 is  a repeat of #3 with the FlexSPI running at 150MHz

 

Final Thoughts

 

From this initial data we can observe the behavior of the FlexSPI controller coupled to an Octal PSRAM through the CACHE64 using some limiting test cases. Real world performance will vary, but this dataset and code can provide a starting point to assess suitability for a specific requirement. These test cases show some of the performance boundaries, so it is to be expected that real world performance will fall between these limits.

While it was out of scope of this paper,  it is possible to execute code from FlexSPI/PSRAM. There is some precedent available with the LPC5536 microcontroller. It uses the same FlexSPI controller and a smaller 8Kb CACHE64 module.

 NXP Application Note AN13591 provides data on XIP performance as compared to code executing from internal Flash:

https://www.nxp.com/docs/en/application-note/AN13591.pdf

Interestingly,  code execution performance is nearly identical when comparing CoreMark scores when running from Internal Flash, Octal SPI Flash and Octal SPI HyperRAM/PSRAM for the LPC5536

Eli_H_16-1719236203597.png

150MHz LPC5536 w/  MX25UM51345GXDI00 Octal Flash CoreMark Score  vs Internal Flash (From AN13591)

 

Eli_H_17-1719236203629.png

150MHz LPC5536 w/ W956D8MBYA5I Octal HyperRAM/PSRAM CoreMark Score vs Internal Flash (From AN13591)

For the most timing critical operations,  low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI interface can enable a large degree of flexibility in potential applications. Adding a large amount of non-volatile memory is simple from the PCB design point of view and does not add significantly to the system BOM.

Using the FRDM-MCXN947 is a simple way to evaluate FlexSPI/PSRAM based design at a low cost. You can get find more information about the FRDM-MCXN947 and the MCX947 microcontroller here with the following links.

https://www.nxp.com/design/design-center/development-boards-and-designs/general-purpose-mcus/frdm-de...

https://www.nxp.com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mc...

 

References:

 

Code reference used for tests in this paper

https://github.com/wavenumber-eng/mcxn947_octal_psram.git

Understanding newlib memcpy performance

https://interrupt.memfault.com/blog/memcpy-newlib-nano

FlexSPI CoreMark Performance on LPC553x/LPC55S3x

https://www.nxp.com/docs/en/application-note/AN13591.pdf

FRDM-MCXN947 Product Page

https://www.nxp.com/design/design-center/development-boards-and-designs/general-purpose-mcus/frdm-de...

MCXN947 Product Page

https://www.nxp.com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mc...

AP Memory APS6408L-3OBM-BA PSRAM Datasheet

https://www.apmemory.com/products/psram-iot-ram/

 

Labels (2)
No ratings
Version history
Last update:
‎07-19-2024 05:34 AM
Updated by: