Performance issue about Cortex-CM4 of i.MX6 SoloX

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Performance issue about Cortex-CM4 of i.MX6 SoloX

1,575 Views
guohu
Contributor I

Hi NXP experts,

When I developing on i.MX6 SoloX, I found confusing scenario on the performance of reading/writing memory when cache is enable.

Following are my test results which measure the number of read, write or both read and write of memory access in 1 seconds for SoloX and Vybrid:

Cache enable

--------------------------------------------------+--------------------+-------------------+

Operation type                                    |     SoloX data     |     Vybrid data   |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with 4-byte interleave     |      4929846       |        7102300    |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with 4-byte interleave           |      15981136      |        9388063    |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with 4-byte interleave          |      11219092      |        9742647    |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with cache line interleave |      3224001       |        2894259    |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with cache line size interleave |      10966110      |        2937975    |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with cache line size interleave  |      6235732       |        3746016    |

--------------------------------------------------+--------------------+-------------------+

Cache disable

--------------------------------------------------+--------------------+-------------------+

Operation type                                    |     SoloX data     |     Vybrid data   |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with 4-byte interleave     |      1275320       |        790198     |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with 4-byte interleave           |      1505184       |        927335     |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with 4-byte interleave          |      1778737       |        1112651    |

--------------------------------------------------+--------------------+-------------------+

Large Array Read&Write with cache line interleave |      1275292       |        790146     |

--------------------------------------------------+--------------------+-------------------+

Large Array Write with cache line size interleave |      1778841       |        1112651    |

--------------------------------------------------+--------------------+-------------------+

Large Array Read with cache line size interleave  |      6235732       |        3746016    |

--------------------------------------------------+--------------------+-------------------+

The frequency of SoloX is 227MHz, while Vybrid is 166MHz.

The 4-byte interleave means 1 cache miss followed by 7 cache hits.

While cache line size(32 bytes) interleave means cache miss for every access.

The performance of SoloX is better than that of Vybrid for most items, it's reasonable because it has higher frequency.

But when doing both read and write with 4-byte interleave in the iteration, the performance of SoloX is worse than Vybrid.

However, when the interleave is cache line size(32 bytes), the performance data looks fine.

I want to know is there any reasonable explanation about the worse performance for SoloX when doing memory read and writing access with high cache hit rate(about 7/8)?

Is there any configuration that would affect this?

Thanks.

Best Regards,

Guohu

Labels (1)
0 Kudos
12 Replies

1,009 Views
guohu
Contributor I

Hi Igor,

I tried ProSupport@nxp.com for so many days.But there is not any respond.

Is there any way to ping them?

Thanks.

Regards,

Guohu

0 Kudos

1,009 Views
igorpadykov
NXP Employee
NXP Employee

Hi Guohu

please forward your request using local fae channel

http://www.nxp.com/files/abstract/global/FREESCALE_SALES_OFFICES.html

Best regards

igor

0 Kudos

1,009 Views
guohu
Contributor I

Hi Igor,

Thank you!

But when I selected Professional Engineer Services, clicked the Request a quote for Services, then submitted the contents.

The web jumped to "Page Not Found" page as following:

QQ-20160526143218.png

Do you have any idea about it?

Best Regards,

Guohu

0 Kudos

1,009 Views
igorpadykov
NXP Employee
NXP Employee

Hi Guohu

please try

ProSupport@nxp.com

Best regards

igor

0 Kudos

1,009 Views
guohu
Contributor I

Thanks.

Have tried and waiting for respond.

0 Kudos

1,009 Views
igorpadykov
NXP Employee
NXP Employee

Hi Guohu

this is special question and requires analysis of different processor

architectures, I am not aware of such performance analysis efforts.

For that reason it may be suggested to apply to

NXP Professional Services:

http://www.nxp.com/support/nxp-professional-services:PROFESSIONAL-SERVICE

Best regards

igor

0 Kudos

1,009 Views
karina_valencia
NXP Apps Support
NXP Apps Support

igorpadykov​ can you continue with the follow up?

0 Kudos

1,009 Views
guohu
Contributor I

Hi,

Is there any idea on this?

0 Kudos

1,009 Views
guohu
Contributor I

To reduce the interference of the operating system, I lock the interrupt and use the systick timer for timing measurement.

Following is my test routine and its objdump result:

#define CNT_1MHZ800000

#define SYSTICK_RVR (*(volatile UINT32 *)0xe000e014)

#define SYSTICK_CVR (*(volatile UINT32 *)0xe000e018)

void testLockIntPerf(void)

    {

    volatile UINT32 counter = 0;

    UINT32 savedRVR;

    UINT32 startCount = 0x00ffffff;

    UINT32 endCount;

    int lockVal;

   

    savedRVR = SYSTICK_RVR;

    lockVal = intLock();

    SYSTICK_RVR = startCount;

    SYSTICK_CVR = 0;

    while (counter < CNT_1MHZ)

        {

        counter++;

        }

    endCount = SYSTICK_CVR;

    intUnlock (lockVal);

    SYSTICK_RVR = savedRVR;

    SYSTICK_CVR = 0;

    printf ("last %d ticks, endCount = 0x%x\n", (startCount - endCount), endCount);

    return;

    }

10000248 <testLockIntPerf>:

10000248:    e92d 41f0     stmdb    sp!, {r4, r5, r6, r7, r8, lr}

1000024c:    b082          sub    sp, #8

1000024e:    4e16          ldr    r6, [pc, #88]    ; (100002a8 <testLockIntPerf+0x60>)

10000250:    2700          movs    r7, #0

10000252:    9700          str    r7, [sp, #0]

10000254:    f8d6 8000     ldr.w    r8, [r6]

10000258:    f001 ff70     bl    1000213c <intCpuLock>

1000025c:    4d13          ldr    r5, [pc, #76]    ; (100002ac <testLockIntPerf+0x64>)

1000025e:    f06f 447f     mvn.w    r4, #4278190080    ; 0xff000000

10000262:    6034          str    r4, [r6, #0]

10000264:    4912          ldr    r1, [pc, #72]    ; (100002b0 <testLockIntPerf+0x68>)

10000266:    602f          str    r7, [r5, #0]

10000268:    9a00          ldr    r2, [sp, #0]

1000026a:    428a          cmp    r2, r1

1000026c:    d205          bcs.n    1000027a <testLockIntPerf+0x32>

1000026e:    9a00          ldr    r2, [sp, #0]

10000270:    3201          adds    r2, #1

10000272:    9200          str    r2, [sp, #0]

10000274:    9a00          ldr    r2, [sp, #0]

10000276:    428a          cmp    r2, r1

10000278:    d3f9          bcc.n    1000026e <testLockIntPerf+0x26>

1000027a:    682f          ldr    r7, [r5, #0]

1000027c:    f001 ff66     bl    1000214c <intCpuMicroUnlock>

10000280:    f8c6 8000     str.w    r8, [r6]

10000284:    2300          movs    r3, #0

10000286:    480b          ldr    r0, [pc, #44]    ; (100002b4 <testLockIntPerf+0x6c>)

10000288:    602b          str    r3, [r5, #0]

1000028a:    1be1          subs    r1, r4, r7

1000028c:    463a          mov    r2, r7

1000028e:    f027 fe9f     bl    10027fd0 <printf>

10000292:    b002          add    sp, #8

10000294:    e8bd 41f0     ldmia.w    sp!, {r4, r5, r6, r7, r8, lr}

10000298:    4770          bx    lr

1000029a:    00000000     andeq    r0, r0, r0

1000029e:    01e98050     mvneq    r8, r0, asr r0

100002a2:    4d181000     ldcmi    0, cr1, [r8, #-0]

100002a6:    e0141006     ands    r1, r4, r6

100002aa:    e018e000     ands    lr, r8, r0

100002ae:    3500e000     strcc    lr, [r0, #-0]

100002b2:    4d2f000c     stcmi    0, cr0, [pc, #-48]!    ; 10000288 <testLockIntPerf+0x40>

100002b6:    47701006     ldrbmi    r1, [r0, -r6]!

When looping 800000 times, the number of timer ticks for i.MX6SX is 15640878, approximating to 0.0689 second.

The number of that for Vybrid is 8000003, approximating to 0.0606 second(the timer for Vybrid runs at 132MHz).

I also used an external chronograph to measure the time of looping 100000000(100M), time spent on i.MX6SX is 8.9 seconds,

time spent on Vybrid is 7.9 seconds.

So the i.MX6SX has worse performance even it runs at higher frequency.

In this scenario, the operand should always in cache, the main loop address range is between 0x1000026e and 0x10000278.

The code and dada for i.MX6SX are placed at 0x10000000 and 0x80500000, respectively. So instruction is accessed via PC bus, and data via PS bus.

0 Kudos

1,009 Views
guohu
Contributor I

Hi Igor,

I first compared the path between the Cortex-M4 and DDR. The i.MX6SX goes through M4(GPV6) and Main(GPV0) block,  while Vybrid goes through switch0 and switch3. They both have 2 levels of routing. Despite the concepts are different(GPV block vs. switch),  could they be considered similar at this aspect?

Secondly, I read out the read_qos/write_qos for Cortex-M4 Core(Port Number: m0) and System(PortNumber: m1). The read_qos default values of m0 are different: 2 for i.MX6SX, 1 for Vybrid. Setting it to 1 for i.MX6SX does not take effect. The wr_tidemark value are also different: 0 for i.MX6SX, 4 for Vybrid. I tried to change it to 4 for i.MX6SX, but it seems that it couldn't be overwritten.

I don't know whether the wr_tidemark matters, and how it would take effect. I got a base address of 0x01142000 for m0 interface of i.MX6SX, and 0x4000a280 for that of Vybrid.

In addition, cache access would not go through the NIC because the cache locates inside the Cortex-M4 Platform. This can be seen at Figure 1-2. Simplified Block Diagram in the technical manual for i.MX6SX , and  at Figure 2-1. Detailed Block Diagram in its technical manual  for Vybrid.

So, would the reason of worse performance on i.MX6SX be only within the Cortex-M4 core platform, and would not spread to the inter-connection?

Thanks!

Best Regards,

Guohu

0 Kudos

1,009 Views
igorpadykov
NXP Employee
NXP Employee

Hi Guohu

I am not aware of similar internal test results, but seems

difference may be explained by different NIC priorities,

one can compare Vybrid settings with i.MX6SX.

I.MX6SX M4 is connected to DDR through two NIC301:

GPV_6 and GPV_0, compared with Vybrid connections depicted

on Figure 2. NIC internal structure AN4947 Understanding Vybrid Architecture.

http://cache.nxp.com/files/microcontrollers/doc/app_note/AN4947.pdf

One can check 3.3 NIC priorities, sect.3.11.4 SDRAM throughput and latencies

and using sect.43.3.4 NIC-specific parameters i.MX6SX Reference Manual

set the same NIC settings.

http://cache.freescale.com/files/32bit/doc/ref_manual/IMX6SXRM.pdf

Best regards

igor

-----------------------------------------------------------------------------------------------------------------------

Note: If this post answers your question, please click the Correct Answer button. Thank you!

-----------------------------------------------------------------------------------------------------------------------

0 Kudos

1,009 Views
guohu
Contributor I

Hi Igor,

Thank you for the direction guide.

I'll check the NIC properties.

Best Regards,

Guohu

0 Kudos