Instruction cache and QSPI XIP performance

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 

Instruction cache and QSPI XIP performance

5,379 次查看
nancyb
Contributor III

I am exploring the QSPI XIP performance for an MQX application running on a TWR-VF65GS10 processor board. I added the capability to enable and disable the instruction cache.

if ( value )

   _ICACHE_ENABLE();

else

   _ICACHE_DISABLE();

The execution performance is the same for either setting. What would account for this behavior? Does the instruction cache help QSPI XIP performance?

标签 (3)
标记 (3)
0 项奖励
回复
17 回复数

4,471 次查看
alejandrolozan1
NXP Employee
NXP Employee

Hi Nancy,

Are you executing the MQX applcation from iRAM or DDR?
The increase of performance should be noticed easily when executing from DDR.

Best Regards,

Alejandro

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

Hi Alejandro,

I executed from iRAM and QSPI. I will try DDR.

Regards,

Nancy

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

My application sums the contents of a variable size buffer. For the following tests I used a 65 KB buffer in DDR with the application running in DDR.

I and D caches enabled: 7 ms.

I cache disabled and D cache enabled: 31 ms.

I cache enabled and D cache disabled: 94 ms

D and I caches disabled: 120 ms.

0 项奖励
回复

4,471 次查看
alejandrolozan1
NXP Employee
NXP Employee

Hi Nancy,

Then it seems that the increase is noticed in DDR correct?

Best Regards,

Alejandro

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

Cache affects DDR execution performance.

I notice no difference in either QSPI XIP or SRAM performance based on enabling I cache.

I started this investigation in order to determine how the I cache affects QSPI XIP performance, which seems to be not at all. Is that accurate, or are there initialization settings that would boost performance?

Given the reduced bandwidth of the QSPI interface I was hoping the I cache would help boost performance.

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

Nancy,

I tried to run Dhrystones from SRAM some time ago. 500MHz CPU clock, 166 - bus clock, 83 - IP clock and got following execution time ratios. Can't tell what optimization settings were used.

I=0 D=0  / I=0 D=1  = 1.92

I=0 D=0  / I=1 D=0  = 2.73

I=0 D=0  / I=1 D=1  = 17.7

I tried to find effect of L2 cache and found no difference. I think this is because Dhrystone code and data is quite small and available L1 cache size is enough to cache both data and code.

It just can't be worse. L1 cache operates, I think, at core clock. SRAM and QSPI operate at much slower clocks.

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

I understand that QSPI operates at much slower clock rate but what I don't understand is why the I cache does not appear to help one way or another. Enabled or disabled the performance is the same. I would have thought that the I cache would speed things up  and when disabled then execution time would increase.

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

In fact I cache speeds things up a lot. Make sure I cache is enabled or disabled when you expect it to. For example, I-cache is enabled out of reset. Are you using DS-5? If so then in debugger you may check if I-cache is enabled here:

Registers -> CP15 -> CP15_SCTLR -> I = Enabled/Disabled

How does your code look like? If it is very simple single loop, then perhaps all code is preloaded to CPU prefetch unit. Up to 4 instructions can be prefetched on Cortex-A5, though I don't know if it is possible to run <=4 instructions loop from PFU without further fetches from I-cache or memory bus.

Also it matters what are you doing in your code. If you are summing words of large QSPI area, all performance I think will be lost on waiting for completion of QSPI transfers. Instead of summing large area try enabling D-cache for QSPI and summing N times the same area of ~10k bytes. You must see the difference.

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

I am using IAR. I verified that Registers -> CP15 -> CP15_SCTLR -> I is set or reset when I issue either _ICACHE_ENABLE() or _ICACHE_DISABLE(). In either case the performance is the same and that puzzles me. I can performance differences when running in DDR.

I will experiment with your suggestion to sum a smaller buffer multiple times so that the code or data can be accessed from cache.

When I am summing the contents of a given buffer I alternately sum the contents of even buffer addresses starting at the buffer's lower addresses and incrementing upwards, and sum the contents of odd buffer addresses decrementing from the upper address. I can sum the buffer as 8, 16, and 32 bit quantities. Here is the summing loop:

   for ( i = 0; i < value; i++ )

   {

      switch ( param )

      {

         case 1:

            if( i & 0x1 )

             {

                temp += *( ( uint_8 * )memPtr + value - i );

             }

             else

             {

                temp += *( ( uint_8 * )memPtr + i );

             }

             break;

          case 2:

             if( i & 0x1 )

             {

                temp += *( ( uint_16 * )memPtr + value - i );

             }

             else

             {

                temp += *( ( uint_16 * )memPtr + i );

             }

             break;

          case 4:

          default:

            if( i & 0x1 )

            {

               temp += *( ( uint_32 * )memPtr + value - i );

            }

            else

            {

               temp += *( ( uint_32 * )memPtr + i );

            }

            break;

         }

       }

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

I tried executing you code with memPtr pointed to QSPI, param=1 and executing from SRAM, D-cache enabled even for QSPI area, L2 cache not enabled. With buffer length (value) set to 10000 I got ~330 vs ~2600us execution times respectively with and without I cache enabled, or almost 8x times difference. With buffer length of 100000 I got ~50 vs 100ms difference, or only 2x. With buffer length =1000000 I got ~600ms to 1000ms, or 1.6x. At larger areas you will get smaller difference or no difference at all, provided you use slower QSPI clocks or modes.

It doesn't surprise me at all. I-cache always helps speeding code, but in case of incremental iteration over large area of slow memory even D-cache doesn't help a lot. While D-cache is enough for keeping slow data cached, we see huge difference switching I-cache on/off. When buffer is large and perhaps QSPI is slower than in my case, we should see small or no difference. Do you agree?

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

Edward,

Thank you for your support and explanation. I wonder if I am not handling the cache enable/disable correctly because I do not see any performance differences when enabling and disabling either cache using _ICACHE_DISABLE() and _DCACHE_DISABLE(). I added  _DCACHE_INVALIDATE() and _ICACHE_INVALIDATE() and still I get the same performance; memPtr->0x20000000, value = 0x10000, and param = 1 I get execution time of 51 ms when executing from SRAM.

Same parameters executing from DDR I get 175 ms (I and D enabled), 199 ms (I cache disabled), 241 ms (I and D cache disabled), and 187 ms (D cache disabled). I had to comment out the call to  _DCACHE_INVALIDATE() when running in DDR to keep execution from hanging.

If I saw numbers like you presented I would have been satisfied with my investigation into cache performance. Ultimately I want to understand how I cache affects QSPI XIP performance.

I would like to duplicate your cache numbers Can you share your cache handling code?

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

Nancy,

I tried bencharking the same with MMU table set up to not D-cache QSPI. Value=100000 completes now in 420ms with I-cache enabled and in 440ms with I-cache disabled. It seems your QSPI is not D-cacheable.

Looks like you are using MQX. MMU table is set up in init_bsp.c. You need to find _mmu_vadd_vregion() calls and then add a call for QSPI:

_mmu_add_vregion((pointer)0x20000000/*QSPI base*/, (pointer)0x20000000/*QSPI base*/, (_mem_size) 0x00100000/*QSPI size, change to appropriate*/, PSP_PAGE_TABLE_SECTION_SIZE(PSP_PAGE_TABLE_SECTION_SIZE_1MB) | PSP_PAGE_TYPE(PSP_PAGE_TYPE_CACHE_WBNWA)   | PSP_PAGE_DESCR(PSP_PAGE_DESCR_ACCESS_RW_ALL));

If you don't want to recompile BSP, I think you need to 1) disable both I- and D-caches (_ICACHE_DISABLE, _DCACHE_DISABLE), 2) dissable MMU (_mmu_vdisable()), 3) call above _mmu_add_vregion() and reenable MMU and caches. I'm using bare metal, didn't try this with MQX, so please sorry in advance if above won't work.

INVALIDATE calls shouldn't help there! They are used when you need to synchronize CPU memory view with what's really in memory. For example after DMA transfer is complete, or when you receive something from USB, or when you get video frame from VIU3 module etc.. In such cases, provided memory to which transfer is done is cacheable, INVALIDATE calls make cache reloaded from memory and CPU again sees right picture. You say it hangs when executing from DDR without invalidate. You need to figure what makes cache out of sync with memory and hanging. For example it could be M4 core writing lparam to cacheable memory and A5 seeing some old huge buffer size value cached. If you see no similar explanation and only A5 is writing and reading loop variables etc in you code, then perhaps you have HW problems with DDR!

Hm, is DDR slower than QPSI? Didn't you make a mistake, 50ms from QSPI and 175ms from DDR with same loop size setting?

4,471 次查看
nancyb
Contributor III

Edward,

DDR was faster than QSPI. For the same benchmark configuration (value 0x10000, param 1) Execution from SRAM was 50 ms, DDR was 175 ms, QSPI was 746 ms.

Thank you for the information dealing with cache control. The DDR hang issue is uninteresting; just something I observed while experimenting with various memory configurations. I am not using the M4 core so I assume it never comes out of reset. Our ultimate application will be a graphical interface where the program runs out of QSPI and some graphics elements are stored in QSPI. I've been tasked with getting as much execution performance out of QSPI as possible. In that exploration I wanted to see what impact the I cache had on execution speed. I would imagine that if the code could fit entirely in the I cache then execution speed would be quite good compared with running out of QSPI.

I added the mmu code you provided to my application. The QSPI XIP performance went from 747 ms to 129 ms. That's better than DDR and only about 3x slower than SRAM.

Thank you for that bit of code. I will consider how to add it to the BSP so that when MQX is rebuilt for a QSPI XIP then the region will be added.

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

Edward,

I rebuilt the BSP with the added region and the performance was outstanding. Benchmark (value 0x10000, param 1) ran in 25 ms.

I can't explain why the performance increased from 129 ms to 25 ms after adding the region during MQX boot vs adding the region from an application task.

Thank you for your help!

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

Nancy,

I'm glad you got this speed improvement. I thought you were executing and comparing data accesses to QSPI while executing from SRAM/DDR/QSPI. Now it's more clear.

For spped purposes D-cacheability should be enabled for area your code executes from. Code contains a lot of address constants which are loaded to CPU registers from memory as data operation, and this is where D-cache helps.

Before trying you code I used to think that we shouldn't see big difference of D-cache on/off doing incremental reads from large area of data. I though it should lead to continuous cache misses and no hits, but in fact I saw 50 to 400ms difference! Well, your code is not exactly incremental reads, but I tried pure incremental read o fbuffer and still see big difference. So it seems my strategy to make some buffers not cached was a wrong strategy. I'm going to try caching those buffers instead and use cache clean and cache invalidate when appropriate. Thanks for hint!

Regarding big difference in D-cache enable for QSPI at boot time or later, I don't know how to explain it. Different code or data alignment should affect cache miss rate and perhaps many other things but difference is very big. I don't know.

Best regards

Edward

0 项奖励
回复

4,471 次查看
nancyb
Contributor III

HI Edward,

I have forwarded your MMU configuration to our team. I did some further experimentation with enabling I and D caches and now I can measure the performance impact each has.

Is there a document that can explore to familiarize myself with the MMU and how to use it? I searched the Vybrid Reference Manual and did not find much information.

Regards,

Nancy

0 项奖励
回复

4,471 次查看
kef2
Senior Contributor V

Hi

You need to go to infocenter.arm.com and look there. MMU is described in ARMv7-M reference manual. Here's direct link to this multi thousand page document.

ARM Information Center

Edward

0 项奖励
回复