How to get rid of CMA

cancel
Showing results for 
Search instead for 
Did you mean: 

How to get rid of CMA

No ratings

How to get rid of CMA

What is CMA

The Contiguous Memory Allocator (CMA) is a framework, which allows setting up a machine-specific configuration for physically-contiguous memory management. Memory for devices is then allocated according to that configuration. The main role of the framework is not to allocate memory, but to parse and manage memory configurations, as well as to act as an in-between between device drivers and pluggable allocators. It is thus not tied to any memory allocation method or strategy. 

Various devices on embedded systems have no scatter-getter and/or IO map support and as such require contiguous blocks of memory to operate. They include devices such as cameras, hardware video decoders and encoders, etc. Such devices often require big memory buffers (a full HD frame is, for instance, more then 2 mega pixels large, i.e. more than 6 MB of memory), which makes mechanisms such as kmalloc() ineffective. Some embedded devices impose additional requirements on the buffers, e.g. they can operate only on buffers allocated in particular location/memory bank (if system has more than one memory bank) or buffers aligned to a particular memory boundary. Development of embedded devices have seen a big rise recently (especially in the V4L area) and many such drivers include their own memory allocation code. Most of them use bootmem-based methods. CMA framework is an attempt to unify contiguous memory allocation mechanisms and provide a simple API for device drivers, while staying as customisable and modular as possible.

Why use it in default release

Most of the i.MX SoC does not have IOMMU for specific IP who requires larger contiguous memory for operations, like VPU/GPU/ISI/CSI. Or they have IOMMU, but performance is not that good. In the default i.MX BSP, we still allocate physical contiguous memory for those IP drivers for DMA transfers.

In arm64 kernel, the DMA allocation API would allocate memory in a various way which depends on the device configurations (in dts or gfp flags). The below table shows how the DMA allocation API (w/o IOMMU enabled device) works to find a proper way for pages (by order, coherent pool -> CMA -> Buddy -> SWIOTLB):

Allocator (by order)Configurations (w/o IOMMU)CommentsMapping

Coherent Pool

  • device dma is not coherent
  • GFP flag is not allow blocking
By __alloc_from_pool()Already mapped on boot when coherent pool init in VMALLOC
CMA
  • device CMA or system CMA is present
  • GFP flag is allow blocking: __GFP_DIRECT_RECLAIM set
By cma_alloc()map_vm_area, mapped in VMALLOC

Buddy

  • No CMA (device or system) or GFP not allow blocking
By __get_free_pages(), which can only allocate from the DMA/normal zone (lowmem), 32bits address spacesAlready mapped in the lowmem area by kernel on boot
SWIOTLB
  • No contiguous pages from buddy or
  • return buffer area region > device dma_mask
By map_single()Already mapped on boot when SWIOTLB init

Also a diagram shows how it works (DMA allocation path):

pastedImage_6.png

By default, kernel uses CMA as a backend of DMA buffers allocation for most of the cases. That's why i.MX BSP use CMA in the default release for GPU/VPU/CSI/ISI or other buffers for DMA transfers.

CMA Pros & Cons

Pros

  • Well designed for large contiguous memory allocation even under memory fragment condition.
  • Pages in CMA can be shared by buddy system, not a reserved pool
  • Can be device specific CMA area, only used by this device and share to system
  • Easy to configure it's start addr and size on runtime w/o re-compile kernel

Cons

  • Allocation process slow when migration pages needed
  • Easy to be corrupted by system memory allocation. Customer may meet cma_alloc failure when system is out of memory, which would cause bad user experiences when foreground application wants graphic buffers for rendering and RVC wants buffers for CAR reverse.
  • Potential dead lock when cma_alloc() need to migrate some pages, which is still flushing to storage (Some customers already met deadlock when one page is under writeback path by FUSE file system, and cma_alloc wants to migrate it). This is the initial motivation to write this documentation.

Why get rid of CMA

RED Cons statement above. The key point is to reserve memory for critical path of allocation like GPU graphic buffers and camera/VPU preview/recording buffers to keep a good user experience from allocation failure which would cause black screen, preview stuck, etc. Also avoid potential dead lock when CMA and FUSE work together.

How to get rid of CMA

To get rid of CMA, the basic idea is to cut off the CMA way in the DMA allocation, turn to coherent pool (atomic pool). Please not that coherent pool can only be used by DMA allocation API, it's not shared to system buddy.

1. Enable coherent pool

Add “coherent_pool=<size>” in command line, Coherent pool is actually allocate from system default CMA, so CMA size > coherent_pool.

There's no reference for this size, as it's various from system to system and use cases to use cases:

  • The biggest consumer of DMA is GPU, it's usage can be monitored by gmem_info tool. Monitor the gmem_info under the typical use cases, and settle down the GPU required memory.
  • Checking for 2nd consumer of DMA: ISI/Camera, depends on the V4l2 reqbuf size and numbers
  • Checking VPU, depends on the multimedia frameworks
  • Plus alsa snd, USB, fec usage

The size must be verified by test to make sure system stable.

2. DMA allocation hack

Hack to arch/arm64/mm/dma-mapping.c, remove the gfpflags_allow_blocking check in the __dma_alloc() function:

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 7015d3e..ef30b46 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,

size = PAGE_ALIGN(size);

- if (!coherent && !gfpflags_allow_blocking(flags)) {
+ if (!coherent) { // && !gfpflags_allow_blocking(flags)) {
struct page *page = NULL;
void *addr = __alloc_from_pool(size, &page, flags);

3. ION allocator

In both Android and Yocto release, ION allocator (android staging driver) is used for VPU buffers. And it default goes into the ION CMA heap. This means ION request for contiguous memory is direct to CMA. To avoid CMA, we can use carveout heap instead of CMA heap in ION:

3.1 Android

Enable CARVEOUT heap, disable CMA heap:

CONFIG_ION=y
CONFIG_ION_SYSTEM_HEAP=y
-CONFIG_ION_CMA_HEAP=y
+CONFIG_ION_CARVEOUT_HEAP=y
+CONFIG_ION_CMA_HEAP=n

Adjust the carveout reserved heap base address and size in the dts:

/ {
 reserved-memory {
 #address-cells = <2>;
 #size-cells = <2>;
 ranges;

 carveout_region: imx_ion@0 {
 compatible = "imx-ion-pool";
 reg = <0x0 0xf8000000 0 0x8000000>;
 };
 };
};

3.2 Linux

  • Kernel - refer to attached patch for i.MX8QM. Almost same as Linux, but ION carveout heap driver need to be patched.
  • Gstreamer - apply below patch to make allocate from carveout:

yocto/build-8qm/tmp/work/aarch64-mx8-poky-linux/gstreamer1.0-plugins-base/1.14.4.imx-r0/git:

diff --git a/gst-libs/gst/allocators/gstionmemory.c b/gst-libs/gst/allocators/gstionmemory.c
index 1218c4a..12e403d 100644
--- a/gst-libs/gst/allocators/gstionmemory.c
+++ b/gst-libs/gst/allocators/gstionmemory.c
@@ -227,7 +227,8 @@ gst_ion_alloc_alloc (GstAllocator * allocator, gsize size,
   }

   for (gint i=0; i<heapCnt; i++) {
-       if (ihd[i].type == ION_HEAP_TYPE_DMA) {
+       if (ihd[i].type == ION_HEAP_TYPE_DMA ||
+             ihd[i].type == ION_HEAP_TYPE_CARVEOUT) {
               heap_mask |= 1 << ihd[i].heap_id;
         }
   }

References

Attachments
Version history
Revision #:
1 of 1
Last update:
‎01-05-2020 07:55 PM
Updated by: