Hi All,
I have been experimenting with the MSC8156EVM board over the last few days using codewarrior 10.2.2 and i am struggling to get the expected memory bandwidth from the device and i was wondering if you had any tips.
My naive application is below. Note that all levels of cache are enabled and a timing overhead has been pre-calculated. Many iterations (10000+) are performed and an average taken. all measurements were performed with a release build:
The results for 16KBytes of data copied from M2 to M3 is 68us which i believe is 240MBytes/s. i would be expecting something closer to 4+GBytes/s given that M3 memory is 128bits wide and clocked at 500MHz (theoretical max of 8GBytes/s)
16KBytes of data copied from M3 to M2 is 100us (163MBytes/s)
16KBytes of data copied from M2 to M2 is 5us (3276MBytes/s) <- expected 8000MB/s???
Thanks
M2_Array = (uint32_t *) osAlignedMalloc((testLength * sizeof(uint32_t)), OS_MEM_LOCAL, ALIGNED_16_BYTES); OS_ASSERT_COND(M2_Array != NULL); M3_Array = (uint32_t *) osAlignedMalloc((testLength * sizeof(uint32_t)), OS_MEM_SHARED, ALIGNED_16_BYTES); OS_ASSERT_COND(M3_Array != NULL); for(i = 0; i < testLength; i++) { srand(i*2); from[i] = rand(); } #if (DCACHE_ENABLE == ON) status = osCacheDataSweepGlobal(CACHE_FLUSH); if (status != OS_SUCCESS) OS_ASSERT;#endif#if (L2CACHE_ENABLE == ON) status = osCacheL2UnifiedSweepGlobal(CACHE_FLUSH); if (status != OS_SUCCESS) OS_ASSERT;#endif timeStart = ReadFullPerfMonCount(); for (i = 0; i < NUM_TEST_ITERATIONS; i++) { memcpy(&M3_Array[0], &M2_Array[0], (testLength * sizeof(uint32_t))); // M2 to M3 } #if (DCACHE_ENABLE == ON) status = osCacheDataSweepGlobal(CACHE_FLUSH); if (status != OS_SUCCESS) OS_ASSERT;#endif#if (L2CACHE_ENABLE == ON) status = osCacheL2UnifiedSweepGlobal(CACHE_FLUSH); if (status != OS_SUCCESS) OS_ASSERT;#endif timeEnd = ReadFullPerfMonCount(); duration = timeEnd - timeStart - overhead; duration /= NUM_TEST_ITERATIONS; printf("Memory Copy %d byte M2 -> M3 duration %llu HSSI clock cycles, %f microseconds\n", (testLength * sizeof(uint32_t)), duration, ((double)duration / osHssiClockGet()));
A few things to think about here.
First of all, as the SC3850 can do 6 instructions in parallel, 2 x AGU operations for moves.You also want to pipeline these moves so you are moving every cycle.
There are many ways to improve the thorughput, but the first thing I would suggest is to turn on software optimization in the CodeWarrior compiler to level -o3.
Also consider that multiple cores can access system level memory.
Regards,
-Andrew