MSC8156 memory bandwidth

matt8156 · ‎11-11-2011

Hi All,

I have been experimenting with the MSC8156EVM board over the last few days using codewarrior 10.2.2 and i am struggling to get the expected memory bandwidth from the device and i was wondering if you had any tips.

My naive application is below. Note that all levels of cache are enabled and a timing overhead has been pre-calculated. Many iterations (10000+) are performed and an average taken. all measurements were performed with a release build:

The results for 16KBytes of data copied from M2 to M3 is 68us which i believe is 240MBytes/s. i would be expecting something closer to 4+GBytes/s given that M3 memory is 128bits wide and clocked at 500MHz (theoretical max of 8GBytes/s)

16KBytes of data copied from M3 to M2 is 100us (163MBytes/s)

16KBytes of data copied from M2 to M2 is 5us (3276MBytes/s) <- expected 8000MB/s???

Thanks

  M2_Array = (uint32_t *) osAlignedMalloc((testLength * sizeof(uint32_t)), OS_MEM_LOCAL, ALIGNED_16_BYTES);    OS_ASSERT_COND(M2_Array != NULL);    M3_Array = (uint32_t *) osAlignedMalloc((testLength * sizeof(uint32_t)), OS_MEM_SHARED, ALIGNED_16_BYTES);    OS_ASSERT_COND(M3_Array != NULL);        for(i = 0; i < testLength; i++)    {        srand(i*2);        from[i] = rand();    }    #if (DCACHE_ENABLE == ON)    status = osCacheDataSweepGlobal(CACHE_FLUSH);    if (status != OS_SUCCESS) OS_ASSERT;#endif#if (L2CACHE_ENABLE == ON)       status = osCacheL2UnifiedSweepGlobal(CACHE_FLUSH);    if (status != OS_SUCCESS) OS_ASSERT;#endif            timeStart = ReadFullPerfMonCount();       for (i = 0; i < NUM_TEST_ITERATIONS; i++)    {        memcpy(&M3_Array[0], &M2_Array[0], (testLength * sizeof(uint32_t)));             // M2 to M3    }    #if (DCACHE_ENABLE == ON)    status = osCacheDataSweepGlobal(CACHE_FLUSH);    if (status != OS_SUCCESS) OS_ASSERT;#endif#if (L2CACHE_ENABLE == ON)       status = osCacheL2UnifiedSweepGlobal(CACHE_FLUSH);    if (status != OS_SUCCESS) OS_ASSERT;#endif                timeEnd = ReadFullPerfMonCount();        duration = timeEnd - timeStart - overhead;    duration /= NUM_TEST_ITERATIONS;        printf("Memory Copy %d byte M2 -> M3 duration %llu HSSI clock cycles, %f microseconds\n", (testLength * sizeof(uint32_t)), duration, ((double)duration / osHssiClockGet()));

AndrewinApps · ‎12-16-2011

A few things to think about here.

First of all, as the SC3850 can do 6 instructions in parallel, 2 x AGU operations for moves.You also want to pipeline these moves so you are moving every cycle.

There are many ways to improve the thorughput, but the first thing I would suggest is to turn on software optimization in the CodeWarrior compiler to level -o3.

Also consider that multiple cores can access system level memory.

Regards,

-Andrew