Performance issue about Cortex-CM4 of i.MX6 SoloX

guohu · ‎04-28-2016

Hi NXP experts,

When I developing on i.MX6 SoloX, I found confusing scenario on the performance of reading/writing memory when cache is enable.

Following are my test results which measure the number of read, write or both read and write of memory access in 1 seconds for SoloX and Vybrid:

Cache enable

--------------------------------------------------+--------------------+-------------------+

Operation type | SoloX data | Vybrid data |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with 4-byte interleave | 4929846 | 7102300 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with 4-byte interleave | 15981136 | 9388063 |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with 4-byte interleave | 11219092 | 9742647 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with cache line interleave | 3224001 | 2894259 |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with cache line size interleave | 10966110 | 2937975 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with cache line size interleave | 6235732 | 3746016 |

--------------------------------------------------+--------------------+-------------------+

Cache disable

--------------------------------------------------+--------------------+-------------------+

Operation type | SoloX data | Vybrid data |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with 4-byte interleave | 1275320 | 790198 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with 4-byte interleave | 1505184 | 927335 |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with 4-byte interleave | 1778737 | 1112651 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with cache line interleave | 1275292 | 790146 |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with cache line size interleave | 1778841 | 1112651 |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with cache line size interleave | 6235732 | 3746016 |

--------------------------------------------------+--------------------+-------------------+

The frequency of SoloX is 227MHz, while Vybrid is 166MHz.

The 4-byte interleave means 1 cache miss followed by 7 cache hits.

While cache line size(32 bytes) interleave means cache miss for every access.

The performance of SoloX is better than that of Vybrid for most items, it's reasonable because it has higher frequency.

But when doing both read and write with 4-byte interleave in the iteration, the performance of SoloX is worse than Vybrid.

However, when the interleave is cache line size(32 bytes), the performance data looks fine.

I want to know is there any reasonable explanation about the worse performance for SoloX when doing memory read and writing access with high cache hit rate(about 7/8)?

Is there any configuration that would affect this?

Thanks.

Best Regards,

Guohu

guohu · ‎06-07-2016

Hi Igor,

I tried ProSupport@nxp.com for so many days.But there is not any respond.

Is there any way to ping them?

Thanks.

Regards,

Guohu

igorpadykov · ‎06-07-2016

Hi Guohu

please forward your request using local fae channel

http://www.nxp.com/files/abstract/global/FREESCALE_SALES_OFFICES.html

Best regards

igor

guohu · ‎05-25-2016

Hi Igor,

Thank you!

But when I selected Professional Engineer Services, clicked the Request a quote for Services, then submitted the contents.

The web jumped to "Page Not Found" page as following:

Do you have any idea about it?

Best Regards,

Guohu

igorpadykov · ‎05-26-2016

Hi Guohu

please try

ProSupport@nxp.com

Best regards

igor

guohu · ‎05-26-2016

Thanks.

Have tried and waiting for respond.

igorpadykov · ‎05-25-2016

Hi Guohu

this is special question and requires analysis of different processor

architectures, I am not aware of such performance analysis efforts.

For that reason it may be suggested to apply to

NXP Professional Services:

http://www.nxp.com/support/nxp-professional-services:PROFESSIONAL-SERVICE

Best regards

igor

karina_valencia · ‎05-25-2016

igorpadykov can you continue with the follow up?

guohu · ‎05-15-2016

Hi,

Is there any idea on this?

guohu · ‎05-05-2016

To reduce the interference of the operating system, I lock the interrupt and use the systick timer for timing measurement.

Following is my test routine and its objdump result:

#define CNT_1MHZ

800000

#define SYSTICK_RVR (*(volatile UINT32 *)0xe000e014)

#define SYSTICK_CVR (*(volatile UINT32 *)0xe000e018)

void testLockIntPerf(void)

{

volatile UINT32 counter = 0;

UINT32 savedRVR;

UINT32 startCount = 0x00ffffff;

UINT32 endCount;

int lockVal;

savedRVR = SYSTICK_RVR;

lockVal = intLock();

SYSTICK_RVR = startCount;

SYSTICK_CVR = 0;

while (counter < CNT_1MHZ)

{

counter++;

}

endCount = SYSTICK_CVR;

intUnlock (lockVal);

SYSTICK_RVR = savedRVR;

SYSTICK_CVR = 0;

printf ("last %d ticks, endCount = 0x%x\n", (startCount - endCount), endCount);

return;

}

10000248 <testLockIntPerf>:

10000248: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}

1000024c: b082 sub sp, #8

1000024e: 4e16 ldr r6, [pc, #88] ; (100002a8 <testLockIntPerf+0x60>)

10000250: 2700 movs r7, #0

10000252: 9700 str r7, [sp, #0]

10000254: f8d6 8000 ldr.w r8, [r6]

10000258: f001 ff70 bl 1000213c <intCpuLock>

1000025c: 4d13 ldr r5, [pc, #76] ; (100002ac <testLockIntPerf+0x64>)

1000025e: f06f 447f mvn.w r4, #4278190080 ; 0xff000000

10000262: 6034 str r4, [r6, #0]

10000264: 4912 ldr r1, [pc, #72] ; (100002b0 <testLockIntPerf+0x68>)

10000266: 602f str r7, [r5, #0]

10000268: 9a00 ldr r2, [sp, #0]

1000026a: 428a cmp r2, r1

1000026c: d205 bcs.n 1000027a <testLockIntPerf+0x32>

1000026e: 9a00 ldr r2, [sp, #0]

10000270: 3201 adds r2, #1

10000272: 9200 str r2, [sp, #0]

10000274: 9a00 ldr r2, [sp, #0]

10000276: 428a cmp r2, r1

10000278: d3f9 bcc.n 1000026e <testLockIntPerf+0x26>

1000027a: 682f ldr r7, [r5, #0]

1000027c: f001 ff66 bl 1000214c <intCpuMicroUnlock>

10000280: f8c6 8000 str.w r8, [r6]

10000284: 2300 movs r3, #0

10000286: 480b ldr r0, [pc, #44] ; (100002b4 <testLockIntPerf+0x6c>)

10000288: 602b str r3, [r5, #0]

1000028a: 1be1 subs r1, r4, r7

1000028c: 463a mov r2, r7

1000028e: f027 fe9f bl 10027fd0 <printf>

10000292: b002 add sp, #8

10000294: e8bd 41f0 ldmia.w sp!, {r4, r5, r6, r7, r8, lr}

10000298: 4770 bx lr

1000029a: 00000000 andeq r0, r0, r0

1000029e: 01e98050 mvneq r8, r0, asr r0

100002a2: 4d181000 ldcmi 0, cr1, [r8, #-0]

100002a6: e0141006 ands r1, r4, r6

100002aa: e018e000 ands lr, r8, r0

100002ae: 3500e000 strcc lr, [r0, #-0]

100002b2: 4d2f000c stcmi 0, cr0, [pc, #-48]! ; 10000288 <testLockIntPerf+0x40>

100002b6: 47701006 ldrbmi r1, [r0, -r6]!

When looping 800000 times, the number of timer ticks for i.MX6SX is 15640878, approximating to 0.0689 second.

The number of that for Vybrid is 8000003, approximating to 0.0606 second(the timer for Vybrid runs at 132MHz).

I also used an external chronograph to measure the time of looping 100000000(100M), time spent on i.MX6SX is 8.9 seconds,

time spent on Vybrid is 7.9 seconds.

So the i.MX6SX has worse performance even it runs at higher frequency.

In this scenario, the operand should always in cache, the main loop address range is between 0x1000026e and 0x10000278.

The code and dada for i.MX6SX are placed at 0x10000000 and 0x80500000, respectively. So instruction is accessed via PC bus, and data via PS bus.

guohu · ‎05-04-2016

Hi Igor,

I first compared the path between the Cortex-M4 and DDR. The i.MX6SX goes through M4(GPV6) and Main(GPV0) block, while Vybrid goes through switch0 and switch3. They both have 2 levels of routing. Despite the concepts are different(GPV block vs. switch), could they be considered similar at this aspect?

Secondly, I read out the read_qos/write_qos for Cortex-M4 Core(Port Number: m0) and System(PortNumber: m1). The read_qos default values of m0 are different: 2 for i.MX6SX, 1 for Vybrid. Setting it to 1 for i.MX6SX does not take effect. The wr_tidemark value are also different: 0 for i.MX6SX, 4 for Vybrid. I tried to change it to 4 for i.MX6SX, but it seems that it couldn't be overwritten.

I don't know whether the wr_tidemark matters, and how it would take effect. I got a base address of 0x01142000 for m0 interface of i.MX6SX, and 0x4000a280 for that of Vybrid.

In addition, cache access would not go through the NIC because the cache locates inside the Cortex-M4 Platform. This can be seen at Figure 1-2. Simplified Block Diagram in the technical manual for i.MX6SX , and at Figure 2-1. Detailed Block Diagram in its technical manual for Vybrid.

So, would the reason of worse performance on i.MX6SX be only within the Cortex-M4 core platform, and would not spread to the inter-connection?

Thanks!

Best Regards,

Guohu

igorpadykov · ‎05-01-2016

Hi Guohu

I am not aware of similar internal test results, but seems

difference may be explained by different NIC priorities,

one can compare Vybrid settings with i.MX6SX.

I.MX6SX M4 is connected to DDR through two NIC301:

GPV_6 and GPV_0, compared with Vybrid connections depicted

on Figure 2. NIC internal structure AN4947 Understanding Vybrid Architecture.

http://cache.nxp.com/files/microcontrollers/doc/app_note/AN4947.pdf

One can check 3.3 NIC priorities, sect.3.11.4 SDRAM throughput and latencies

and using sect.43.3.4 NIC-specific parameters i.MX6SX Reference Manual

set the same NIC settings.

http://cache.freescale.com/files/32bit/doc/ref_manual/IMX6SXRM.pdf

Best regards

igor

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

guohu · ‎05-02-2016

Hi Igor,

Thank you for the direction guide.

I'll check the NIC properties.

Best Regards,

Guohu

Performance issue about Cortex-CM4 of i.MX6 SoloX

Performance issue about Cortex-CM4 of i.MX6 SoloX

i.MX6SoloX