i.MX6 Q has four Cortex-A9 cores, how about their performance for floating computing?
Here is our test:
MX6Q has four Cortex A9 cores running at 1.2G, we first make a virtual computation test for one core.
Virtual computation means computing data by registers, for the core no data input or output, this is test the full working speed of the cpu core. We made following codes:
__asm__ volatile ("mov r5, #0\n"
"vld1.f32 q0, r5\n"
"vld1.f32 q1, r5\n"
"vld1.f32 q2, r5\n"
"vld1.f32 q3, r5\n"
"vld1.f32 q4, r5\n"
"mov r5, %[count] \n"
".loop: \n"
"vmla.f32 q0, q1, q1 \n"
"vmla.f32 q1, q2, q2 \n"
"vmla.f32 q2, q3, q3 \n"
"vmla.f32 q3, q4, q4 \n"
"vmla.f32 q4, q2, q2 \n"
"subs r5, r5, #1 \n"
"bge .loop \n"
:
:[count] "r" (ncount)
: "q0", "q1", "q2", "q3", "q4", "r5");
In one loop, there are 20 multiplications and 20 additions, 40FLOPS. We ran it in 500,000 loops.
The result is 3.98G/s (32bit float). This is the computing power of one core. See, Cortex-A9 is rather a powerful core.
Then we made a multi-thread program, to test the real computing power of i.MX6Q. The core part of the program is (the thread function):
void* multiply(void * slice)
{
int s = ((int *)slice)[0]; // retrieve the thread number
int from = (s * dim)/num_thrd; // start row of matrix computing
int to = ((s+1) * dim)/num_thrd; // end row of matrix computing
int i,j,k, kk;
float tmp, *pA, *pB;
float *fp, tt;
i=from;
for(kk=0; kk<(to-from)*dim; kk++) {
j=kk%dim; i = from+kk/dim;
tmp=0.0; k=dim; pA = &(A[i*dim]); pB = &(B[j*dim]);
__asm__ volatile (
"vmov.f32 q8, #0.0 \n\t"
"vmov.f32 q9, #0.0 \n\t"
"vld1.f32 {d0,d1,d2,d3}, [%1]! \n\t"
"vld1.f32 {d4,d5,d6,d7}, [%2]! \n\t"
"1: \n\t"
"vmla.f32 q8, q0, q2 \n\t"
"vmla.f32 q9, q1, q3 \n\t"
"vld1.f32 {d0,d1,d2,d3}, [%1]! \n\t"
"vld1.f32 {d4,d5,d6,d7}, [%2]! \n\t"
"subs %3, %3, #8 \n\t"
"bgt 1b \n\t"
"vadd.f32 q8, q8, q9 \n\t"
"vpadd.f32 d0, d16, d17 \n\t"
"vadd.f32 %0, s0, s1 \n\t"
: "=w"(tmp), "+r"(pA), "+r"(pB), "+r"(k)
: /* No inputs */
: "q0", "q1", "q2", "q3", "q8", "q9", "cc", "memory");
C[i*dim+j] = tmp;
}
}
We tested the program for two 1000x1000 matrix multiplication algorithm, the total computing load is 2GFLOPS, and the result is:
1 thread (using one core) 190MFLOPS
2 threads (using two cores) 380MFLOPS
4 threads (using four cores) 740MFLOPS
you may further optimize this code, but we tried very hard and used all kinds optimization method, things changed not much. The four cores's performance is above the speed of a 64-bit AXI bus, i think, the reason is that i.MX6 Q has two 64-bit AXI buses, each was assigned to two cores.
Compared with the virtual computation speed 3.9G per core, there is a big difference. I do not understand why the performance of one core is much less the throughput of the 64-bit AXI bus (2.1G) and the performance of the GC2000 GPU, maybe two cores shared one 64bit bus, and one core can only use half the bandwidth of the bus, or the reading and writing memory need more time than GPU?
But, still the real computing power of i.MX6Q is impressing.
Could you show the compilation flags you used?
gcc -pthread -mcpu=cortex-a9 -mfpu=neon -lrt
Hi, sorry to trouble you. Could you give me the whole program to me ? I am not familiar with the c program.So I do not know the "slice" should be what. Thank you . zhangqingsdu@gamil.com