Dear all,
The performances of my video processing project by imx6 seem to be limited by by the memory bandwidth.
I have benchmarked the memory bandwidth and the result is just 10% of the theoretical value.
Configuration 1:
Freescale SABRE board (MCIMX6Q-SDB)
Memory. 1 GB DDR3 SDRAM up to 533 MHz (1066 MTPS)
Ubuntu L3.0.35_1.1.0_121218
Configuration 2:
BoundaryDevices Nitrogen6X
Memory 1GBytes of 64-bit wide DDR3 @ 532MHz
Ubuntu L3.0.35_4.0.0_UBUNTU_RFS / or LTIB 4.0.0 / or Original Demo SD
Theoretical memory bandwidth:
DDR3 1066 x 64 bits = > 1066*8 = 8528 MB/s
Measured memory bandwidth:
I use the mbw tool. I have checked the source, and the memcpy method is the 1 (0 is loop 'for', and 2 is a small mem block cached).
# mbw -t1 100
AVG Method: DUMB Elapsed: 0.24973 MiB: 100.00000 Copy: 400.426 MiB/s
This 'memcpy' reads and writes the memory, and the real memory bandwidth is 2x400 = 800 MB/s, just 10% of the theoretical value.
I made my own memory test project, and the results are equivalent.
I guess the display used part of memory bandwidth, but where is the other 90% memory bandwidth ?
If this result is normal, this limitation stop my project.
Is there any ideas to help me go on?
Thank you very much.
1.
As for "Theoretical memory bandwidth:
DDR3 1066 x 64 bits = > 1066*8 = 8528 MB/s".
Such calculations, assuming that data are provided at every clock edge, are very "theoretical" :-).
Real DDR data access burst needs some preparation stage : bus arbitration, RAS phase, CAS phase,
CAS Latency and only after that we can get data at every clock edge. So, let's divide 8528 MB/s by 2.
2.
Next, screen refresh for high resolution modes may require significant bus throughput.
Say, for resolution 1600 x 1200, 32-bit per pixel, 60 Hz refresh : ~ 460 MB / sec = 1600 x 1200 x 4bytes x 60
3.
If VPU codecs are applied it is needed at least to read frame buffer, encode / decode it and write back,
so for 1920x1088@30fps : ~ 500 MB/sec = 1920 x 1080 x 4bytes x 30 x 2
4.
You wrote about memcpy tests : this is a question if memcpy is maximally optimized for ARMv7.
Say - if it uses NEON instructions.
Thank you for your answers.
3.
I do nothing during the test. The VPU is not used.
4.
I use standard memcpy provided by gcc.
The project is compiled with : -mfloat-abi=softfp -mfpu=neon -mcpu=cortex-a9 -march=armv7-a -fprefetch-loop-arrays
I have attached the source code of my MemTest project (mbw code based).
I have added multi-thread for the test, and it improves the bandwidth (R/W):
1 thread : 407.559 MB/s
2 threads : 605.705 MB/s
3 threads : 642.874 MB/s
4 threads : 636.358 MB/s
The limit seems to be (with the display) 640*2 + 460 = 1740 MB/s
a.
How can I check the DDR frequency ?
- dmidecode is not implemented on arm
- lshw has an error
# lshw -short -C memory
> Unhandled fault: external abort on non-linefetch (0x018) at 0x2b3c0000
b.
Is it possible, for an application, to do a memory copy with a DMA transfer ?
You may try the (GNU asm) example below for copying (it uses NEON instructions) :
/* void transfer_eight_words_vld(int* dst, const int *src) */
transfer_eight_words_vld:
vld1.64 {d0,d1,d2,d3}, [r1]
vst1.64 {d0,d1,d2,d3}, [r0]
mov pc, lr /* return */
Note, there is no need for storing / saving general registers (they are not used)
as it is usually required for C functions.
Also, before using it, NEON must be enabled.
* Enable NEON */
/* --- enable NEON VFP */
ldr r0, =ARM_CACR_CONFIG /* ; r0 = CACR configuration */
mcr p15, 0, r0, c1, c0, 2 /* ; update CACR */
isb
/* --- Enable VFP */
fmrx r0, fpexc
orr r0, r0, #VFP_NEON_ENABLE
fmxr fpexc, r0
/* --- Set VFP to runfast mode */
fmrx r0, fpscr
orr r0, r0, #VFP_RUN_FAST
fmxr fpscr, r0