Instruction cache and QSPI XIP performance

nancyb · ‎02-11-2014

I am exploring the QSPI XIP performance for an MQX application running on a TWR-VF65GS10 processor board. I added the capability to enable and disable the instruction cache.

if ( value )

_ICACHE_ENABLE();

else

_ICACHE_DISABLE();

The execution performance is the same for either setting. What would account for this behavior? Does the instruction cache help QSPI XIP performance?

alejandrolozan1 · ‎02-11-2014

Hi Nancy,

Are you executing the MQX applcation from iRAM or DDR?
The increase of performance should be noticed easily when executing from DDR.

Best Regards,

Alejandro

nancyb · ‎02-11-2014

Hi Alejandro,

I executed from iRAM and QSPI. I will try DDR.

Regards,

Nancy

nancyb · ‎02-11-2014

My application sums the contents of a variable size buffer. For the following tests I used a 65 KB buffer in DDR with the application running in DDR.

I and D caches enabled: 7 ms.

I cache disabled and D cache enabled: 31 ms.

I cache enabled and D cache disabled: 94 ms

D and I caches disabled: 120 ms.

alejandrolozan1 · ‎02-11-2014

Hi Nancy,

Then it seems that the increase is noticed in DDR correct?

Best Regards,

Alejandro

nancyb · ‎02-11-2014

Cache affects DDR execution performance.

I notice no difference in either QSPI XIP or SRAM performance based on enabling I cache.

I started this investigation in order to determine how the I cache affects QSPI XIP performance, which seems to be not at all. Is that accurate, or are there initialization settings that would boost performance?

Given the reduced bandwidth of the QSPI interface I was hoping the I cache would help boost performance.

kef2 · ‎02-12-2014

Nancy,

I tried to run Dhrystones from SRAM some time ago. 500MHz CPU clock, 166 - bus clock, 83 - IP clock and got following execution time ratios. Can't tell what optimization settings were used.

I=0 D=0 / I=0 D=1 = 1.92

I=0 D=0 / I=1 D=0 = 2.73

I=0 D=0 / I=1 D=1 = 17.7

I tried to find effect of L2 cache and found no difference. I think this is because Dhrystone code and data is quite small and available L1 cache size is enough to cache both data and code.

It just can't be worse. L1 cache operates, I think, at core clock. SRAM and QSPI operate at much slower clocks.

nancyb · ‎02-12-2014

I understand that QSPI operates at much slower clock rate but what I don't understand is why the I cache does not appear to help one way or another. Enabled or disabled the performance is the same. I would have thought that the I cache would speed things up and when disabled then execution time would increase.

kef2 · ‎02-12-2014

In fact I cache speeds things up a lot. Make sure I cache is enabled or disabled when you expect it to. For example, I-cache is enabled out of reset. Are you using DS-5? If so then in debugger you may check if I-cache is enabled here:

Registers -> CP15 -> CP15_SCTLR -> I = Enabled/Disabled

How does your code look like? If it is very simple single loop, then perhaps all code is preloaded to CPU prefetch unit. Up to 4 instructions can be prefetched on Cortex-A5, though I don't know if it is possible to run <=4 instructions loop from PFU without further fetches from I-cache or memory bus.

Also it matters what are you doing in your code. If you are summing words of large QSPI area, all performance I think will be lost on waiting for completion of QSPI transfers. Instead of summing large area try enabling D-cache for QSPI and summing N times the same area of ~10k bytes. You must see the difference.

nancyb · ‎02-12-2014

I am using IAR. I verified that Registers -> CP15 -> CP15_SCTLR -> I is set or reset when I issue either _ICACHE_ENABLE() or _ICACHE_DISABLE(). In either case the performance is the same and that puzzles me. I can performance differences when running in DDR.

I will experiment with your suggestion to sum a smaller buffer multiple times so that the code or data can be accessed from cache.

When I am summing the contents of a given buffer I alternately sum the contents of even buffer addresses starting at the buffer's lower addresses and incrementing upwards, and sum the contents of odd buffer addresses decrementing from the upper address. I can sum the buffer as 8, 16, and 32 bit quantities. Here is the summing loop:

for ( i = 0; i < value; i++ )

{

switch ( param )

{

case 1:

if( i & 0x1 )

{

temp += *( ( uint_8 * )memPtr + value - i );

}

else

{

temp += *( ( uint_8 * )memPtr + i );

}

break;

case 2:

if( i & 0x1 )

{

temp += *( ( uint_16 * )memPtr + value - i );

}

else

{

temp += *( ( uint_16 * )memPtr + i );

}

break;

case 4:

default:

if( i & 0x1 )

{

temp += *( ( uint_32 * )memPtr + value - i );

}

else

{

temp += *( ( uint_32 * )memPtr + i );

}

break;

}

kef2 · ‎02-13-2014

I tried executing you code with memPtr pointed to QSPI, param=1 and executing from SRAM, D-cache enabled even for QSPI area, L2 cache not enabled. With buffer length (value) set to 10000 I got ~330 vs ~2600us execution times respectively with and without I cache enabled, or almost 8x times difference. With buffer length of 100000 I got ~50 vs 100ms difference, or only 2x. With buffer length =1000000 I got ~600ms to 1000ms, or 1.6x. At larger areas you will get smaller difference or no difference at all, provided you use slower QSPI clocks or modes.

It doesn't surprise me at all. I-cache always helps speeding code, but in case of incremental iteration over large area of slow memory even D-cache doesn't help a lot. While D-cache is enough for keeping slow data cached, we see huge difference switching I-cache on/off. When buffer is large and perhaps QSPI is slower than in my case, we should see small or no difference. Do you agree?

nancyb · ‎02-13-2014

Edward,

Thank you for your support and explanation. I wonder if I am not handling the cache enable/disable correctly because I do not see any performance differences when enabling and disabling either cache using _ICACHE_DISABLE() and _DCACHE_DISABLE(). I added _DCACHE_INVALIDATE() and _ICACHE_INVALIDATE() and still I get the same performance; memPtr->0x20000000, value = 0x10000, and param = 1 I get execution time of 51 ms when executing from SRAM.

Same parameters executing from DDR I get 175 ms (I and D enabled), 199 ms (I cache disabled), 241 ms (I and D cache disabled), and 187 ms (D cache disabled). I had to comment out the call to _DCACHE_INVALIDATE() when running in DDR to keep execution from hanging.

If I saw numbers like you presented I would have been satisfied with my investigation into cache performance. Ultimately I want to understand how I cache affects QSPI XIP performance.

I would like to duplicate your cache numbers Can you share your cache handling code?

kef2 · ‎02-13-2014

Nancy,

I tried bencharking the same with MMU table set up to not D-cache QSPI. Value=100000 completes now in 420ms with I-cache enabled and in 440ms with I-cache disabled. It seems your QSPI is not D-cacheable.

Looks like you are using MQX. MMU table is set up in init_bsp.c. You need to find _mmu_vadd_vregion() calls and then add a call for QSPI:

_mmu_add_vregion((pointer)0x20000000/*QSPI base*/, (pointer)0x20000000/*QSPI base*/, (_mem_size) 0x00100000/*QSPI size, change to appropriate*/, PSP_PAGE_TABLE_SECTION_SIZE(PSP_PAGE_TABLE_SECTION_SIZE_1MB) | PSP_PAGE_TYPE(PSP_PAGE_TYPE_CACHE_WBNWA) | PSP_PAGE_DESCR(PSP_PAGE_DESCR_ACCESS_RW_ALL));

If you don't want to recompile BSP, I think you need to 1) disable both I- and D-caches (_ICACHE_DISABLE, _DCACHE_DISABLE), 2) dissable MMU (_mmu_vdisable()), 3) call above _mmu_add_vregion() and reenable MMU and caches. I'm using bare metal, didn't try this with MQX, so please sorry in advance if above won't work.

INVALIDATE calls shouldn't help there! They are used when you need to synchronize CPU memory view with what's really in memory. For example after DMA transfer is complete, or when you receive something from USB, or when you get video frame from VIU3 module etc.. In such cases, provided memory to which transfer is done is cacheable, INVALIDATE calls make cache reloaded from memory and CPU again sees right picture. You say it hangs when executing from DDR without invalidate. You need to figure what makes cache out of sync with memory and hanging. For example it could be M4 core writing lparam to cacheable memory and A5 seeing some old huge buffer size value cached. If you see no similar explanation and only A5 is writing and reading loop variables etc in you code, then perhaps you have HW problems with DDR!

Hm, is DDR slower than QPSI? Didn't you make a mistake, 50ms from QSPI and 175ms from DDR with same loop size setting?

nancyb · ‎02-13-2014

Edward,

DDR was faster than QSPI. For the same benchmark configuration (value 0x10000, param 1) Execution from SRAM was 50 ms, DDR was 175 ms, QSPI was 746 ms.

Thank you for the information dealing with cache control. The DDR hang issue is uninteresting; just something I observed while experimenting with various memory configurations. I am not using the M4 core so I assume it never comes out of reset. Our ultimate application will be a graphical interface where the program runs out of QSPI and some graphics elements are stored in QSPI. I've been tasked with getting as much execution performance out of QSPI as possible. In that exploration I wanted to see what impact the I cache had on execution speed. I would imagine that if the code could fit entirely in the I cache then execution speed would be quite good compared with running out of QSPI.

I added the mmu code you provided to my application. The QSPI XIP performance went from 747 ms to 129 ms. That's better than DDR and only about 3x slower than SRAM.

Thank you for that bit of code. I will consider how to add it to the BSP so that when MQX is rebuilt for a QSPI XIP then the region will be added.

nancyb · ‎02-13-2014

Edward,

I rebuilt the BSP with the added region and the performance was outstanding. Benchmark (value 0x10000, param 1) ran in 25 ms.

I can't explain why the performance increased from 129 ms to 25 ms after adding the region during MQX boot vs adding the region from an application task.

Thank you for your help!

kef2 · ‎02-14-2014

Nancy,

I'm glad you got this speed improvement. I thought you were executing and comparing data accesses to QSPI while executing from SRAM/DDR/QSPI. Now it's more clear.

For spped purposes D-cacheability should be enabled for area your code executes from. Code contains a lot of address constants which are loaded to CPU registers from memory as data operation, and this is where D-cache helps.

Before trying you code I used to think that we shouldn't see big difference of D-cache on/off doing incremental reads from large area of data. I though it should lead to continuous cache misses and no hits, but in fact I saw 50 to 400ms difference! Well, your code is not exactly incremental reads, but I tried pure incremental read o fbuffer and still see big difference. So it seems my strategy to make some buffers not cached was a wrong strategy. I'm going to try caching those buffers instead and use cache clean and cache invalidate when appropriate. Thanks for hint!

Regarding big difference in D-cache enable for QSPI at boot time or later, I don't know how to explain it. Different code or data alignment should affect cache miss rate and perhaps many other things but difference is very big. I don't know.

Best regards

Edward

nancyb · ‎02-14-2014

HI Edward,

I have forwarded your MMU configuration to our team. I did some further experimentation with enabling I and D caches and now I can measure the performance impact each has.

Is there a document that can explore to familiarize myself with the MMU and how to use it? I searched the Vybrid Reference Manual and did not find much information.

Regards,

Nancy

kef2 · ‎02-14-2014

Hi

You need to go to infocenter.arm.com and look there. MMU is described in ARMv7-M reference manual. Here's direct link to this multi thousand page document.

ARM Information Center

Edward

Instruction cache and QSPI XIP performance

Instruction cache and QSPI XIP performance

MQX

Tower Board

VF6xx