imx6 memory bandwidth problem

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

imx6 memory bandwidth problem

7,593 Views
ericb
Contributor II

Dear all,

The performances of my video processing project by imx6 seem to be limited by by the memory bandwidth.

I have benchmarked the memory bandwidth and the result is just 10% of the theoretical value.

Configuration 1:

Freescale SABRE board (MCIMX6Q-SDB)

Memory. 1 GB DDR3 SDRAM up to 533 MHz (1066 MTPS)

Ubuntu L3.0.35_1.1.0_121218

Configuration 2:

BoundaryDevices Nitrogen6X

Memory 1GBytes of 64-bit wide DDR3 @ 532MHz

Ubuntu L3.0.35_4.0.0_UBUNTU_RFS / or LTIB 4.0.0 / or Original Demo SD

Theoretical memory bandwidth:

DDR3 1066 x 64 bits = > 1066*8 = 8528 MB/s

Measured memory bandwidth:

I use the mbw tool.  I have checked the source, and the memcpy method is the 1 (0 is loop 'for', and 2 is a small mem block cached).

# mbw -t1 100

AVG     Method: DUMB    Elapsed: 0.24973        MiB: 100.00000  Copy: 400.426 MiB/s

This 'memcpy' reads and writes the memory, and the real memory bandwidth is 2x400 = 800 MB/s, just 10% of the theoretical value.

I made my own memory test project, and the results are equivalent.

I guess the display used part of memory bandwidth, but where is the other 90% memory bandwidth ?

If this result is normal, this limitation stop my project.

Is there any ideas to help me go on?

Thank you very much.


3 Replies

3,389 Views
Yuri
NXP Employee
NXP Employee

1.
As for "Theoretical memory bandwidth:

DDR3 1066 x 64 bits = > 1066*8 = 8528 MB/s".

Such calculations, assuming that data are provided at every clock edge, are very "theoretical" :-).
Real DDR data access burst needs some preparation stage : bus arbitration, RAS phase, CAS phase,
CAS Latency and only after that we can get data  at every clock edge. So, let's divide 8528 MB/s by 2.


2. 

Next, screen refresh for high resolution modes may require significant bus throughput.
Say, for resolution 1600 x 1200, 32-bit per pixel, 60 Hz refresh :  ~ 460 MB / sec = 1600 x 1200 x 4bytes x 60

3.
If VPU codecs are applied it is needed at least to read frame buffer, encode / decode it and write back,
so for 1920x1088@30fps : ~ 500 MB/sec = 1920 x 1080 x 4bytes x 30 x 2

4.

You wrote about memcpy tests : this is a question if memcpy is maximally optimized for ARMv7.
Say - if it  uses NEON instructions.

3,389 Views
ericb
Contributor II

Thank you for your answers.

3.

I do nothing during the test. The VPU is not used.

4.

I use standard memcpy provided by gcc.

The project is compiled with : -mfloat-abi=softfp -mfpu=neon -mcpu=cortex-a9 -march=armv7-a -fprefetch-loop-arrays

I have attached the source code of my MemTest project (mbw code based).

I have added multi-thread for the test, and it improves the bandwidth (R/W):

1 thread  : 407.559 MB/s

2 threads : 605.705 MB/s

3 threads : 642.874 MB/s

4 threads : 636.358 MB/s

The limit seems to be (with the display) 640*2 + 460 = 1740 MB/s

a.

How can I check the DDR frequency ?

- dmidecode is not implemented on arm

- lshw has an error

# lshw -short -C memory

> Unhandled fault: external abort on non-linefetch (0x018) at 0x2b3c0000

b.

Is it possible, for an application, to do a memory copy with a DMA transfer ?


0 Kudos

3,390 Views
Yuri
NXP Employee
NXP Employee

You may try the (GNU asm) example below for copying (it uses NEON instructions) :

/* void transfer_eight_words_vld(int* dst, const int *src) */

transfer_eight_words_vld:

  vld1.64 {d0,d1,d2,d3}, [r1]

  vst1.64 {d0,d1,d2,d3}, [r0]

  mov pc, lr                         /* return */

Note, there is no need for storing / saving general registers (they are not used)
as it is usually required for C functions.

Also, before using it, NEON must be enabled.

* Enable NEON  */

/* --- enable NEON VFP */

    ldr     r0, =ARM_CACR_CONFIG                /* ; r0 = CACR configuration */

    mcr     p15, 0, r0, c1, c0, 2               /* ; update CACR */

    isb      

/* --- Enable VFP */

    fmrx     r0, fpexc

    orr      r0, r0, #VFP_NEON_ENABLE

    fmxr     fpexc, r0

/* --- Set VFP to runfast mode */

    fmrx     r0, fpscr

    orr      r0, r0, #VFP_RUN_FAST

    fmxr     fpscr, r0