Quadspi XIP Best MMU Settings

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Quadspi XIP Best MMU Settings

Jump to solution
2,359 Views
oldyear
Contributor II

Background:

I'm using the same processor as the Vybrid tower board the MVF61NS.  This is the package with L1,L2 cache clocked at 500mhz.  We are running our entire application bare-metal from spansion quadspi XIP (no external RAM at all). We're using a processing heavy commercial library that is a black box to us (provided as a .lib).  Also we are using the ARM DS-5 development suite.

Issue:

The library purportedly runs 7 times faster than the results we're getting.  The supplier provided a binary to us that supports that as well but it only runs on top of the timesys Linux distro that is packaged with the tower board:  I was able to run the binary on our board with linux and see the fast results which leads me to believe it is possible though we can't use Linux in our application.I have enabled caching on the quadspi section on our image and I see significant performance gains but we are still 7 times slower than running off the linux board.

I did a little bit more sleuthing to make sure that the quadspi program data was indeed being cached: Using an oscilloscope I probed the clock line on the quadspi chip during our library function calls and only see clocks during less than 5% of the time.  This suggests to me that a majority of the program is already cached.

Questions:

What would be the best settings for the MMU for speed running quadspi XIP?

Can anyone think of any other register settings that could need to be enabled to get a performance bump like this?

Many thanks in advance!

Labels (4)
1 Solution
1,856 Views
CommunityBot
Community Manager
This an automatic process.

We are marking this post as solved, due to the either low activity or any reply marked as correct.

If you have additional questions, please create a new post and reference to this closed post.

NXP Community!

View solution in original post

0 Kudos
15 Replies
1,857 Views
CommunityBot
Community Manager
This an automatic process.

We are marking this post as solved, due to the either low activity or any reply marked as correct.

If you have additional questions, please create a new post and reference to this closed post.

NXP Community!
0 Kudos
1,857 Views
kef2
Senior Contributor IV

Nicholas,

did you enable data cache? Instruction cache I think is enabled out of reset, but not data cache. And if you are using DS-5 compiler (perhaps applies to others to), it likes to put a lot of constants to code, loadable data constants, not immediate addressing constants, which could be cached with instructions cache. Yes, enabling data cache and making (code areas as well) cached, you easily can get ~7x speedup. To enable data cache you need first to setup MMU tables properly, which is not trivial and not universal for each app.

  • I did a little bit more sleuthing to make sure that the quadspi program data was indeed being cached: Using an oscilloscope I probed the clock line on the quadspi chip during our library function calls and only see clocks during less than 5% of the time.  This suggests to me that a majority of the program is already cached.

Not at all. QSPI controller itself is caching device. Once a row of data is read from NAND device  to QSPI RAM, QSPI won't read from NAND device until you request data not cached in QSPI RAM.

0 Kudos
1,857 Views
oldyear
Contributor II

Both the icache and the dcache are enabled at startup. Copy of the SCTLR:

SCTLR0x00C5187D32R/W
TEDisabled1R/W
AFEDisabled1R/W
TREDisabled1R/W
EEClear1R/W
HAClear1R/W
RRClear1R/W
VNormal1R/W
IEnabled1R/W
ZEnabled1R/W
SWDisabled1R/W
CEnabled1R/W
ADisabled1R/W
MEnabled1

R/W

The C and the I members being the Data and Instruction cache enables respectively.

The 0x20000000 block corresponds to the only part of the quadspi i use for executable space

The 0x3F000000 block corresponds to the usable part of the general purpose iRAM.  Heap and Stack live here.  I have tried disabling caching on this section but found no difference.

Memory Map:

Virtual AddressPhysical AddressTypeAPCachableSharedExecutable
S:0x00000000-0x000FFFFFSP:0x00000000-0x000FFFFFNormalRWYNY
S:0x00100000-0x1FFFFFFFSP:0x00100000-0x1FFFFFFFStrongly-orderedRWNYY
S:0x20000000-0x200FFFFFSP:0x20000000-0x200FFFFFNormalRWYNY
S:0x20100000-0x3EFFFFFFSP:0x20100000-0x3EFFFFFFStrongly-orderedRWNYY
S:0x3F000000-0x3F0FFFFFSP:0x3F000000-0x3F0FFFFFNormalRWYNY
S:0x3F100000-0x3F3FFFFFSP:0x3F100000-0x3F3FFFFFStrongly-orderedRWNYY
S:0x3F400000-0x3F4FFFFFSP:0x3F400000-0x3F4FFFFFNormalRWYNY
S:0x3F500000-0xFFFFFFFFSP:0x3F500000-0xFFFFFFFFStrongly-orderedRWNYY

Specific Table Entries:

0x20000000SectionSP:0x20000000NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0
0x3F000000SectionSP:0x3F000000NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0
0x3F100000SectionSP:0x3F100000NS=0, nG=0, S=0, AP=0x2, TEX=0x0, Domain=0, XN=0, C=0, B=0, PXN=0
0x3F400000SectionSP:0x3F400000NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0
0 Kudos
1,857 Views
kef2
Senior Contributor IV

Cortex-A5 rev r0p1 Technical Reference Manual:

6.2.1 Memory types

Although various different memory types can be specified in the page tables, the Cortex-A5 processor does not implement all possible combinations:

Write-through caches are not supported. Any memory marked as write-through is treated as Non-cacheable.

• The outer shareable attribute is not supported. Anything marked as outer shareable is treated in the same way as inner shareable.

Write-back no write allocate is not supported. It is treated as write-back write-allocate.

Looks like TEX=0, C=1, B=0 is write through, no write allocate. Try setting B=1.

Edward

0 Kudos
1,857 Views
oldyear
Contributor II

Edward,

Thanks for the feedback. I just tried your suggestion: there was no perceptible change.  If I set TEX=0, C=0 and B=0 there is clearly a significant drop in speed as if the code is not being cached any longer.  Not sure what to think of that.

Thanks,

Nick

0 Kudos
1,857 Views
kef2
Senior Contributor IV

Did you enable L2 cache?

Edward

0 Kudos
1,857 Views
oldyear
Contributor II

I am calling the L2 Cache enable functions.  Do you know which registers I can check to confirm that it is actually being enabled?

Thanks,

Nick

0 Kudos
1,857 Views
kef2
Senior Contributor IV

Try getting from arm.com

AMBA® Level 2 Cache Controller

(L2C-310)

Revision: r3p1

Technical Reference Manual

(DDI0246E_l2c310_r3p1_trm.pdf)

L2 controller base in VF6 is 0x40060000. reg1_control register with enable bit is at L2 base + 0x100, but I guess you need to configure other registers.

1,857 Views
karina_valencia
NXP Apps Support
NXP Apps Support

jiri-b36968​ do you have an update?

0 Kudos
1,857 Views
karina_valencia
NXP Apps Support
NXP Apps Support

jiri-b36968​ can you comment?

0 Kudos
1,857 Views
karina_valencia
NXP Apps Support
NXP Apps Support

reminder

Jiri Kotzian can you comment?

0 Kudos
1,857 Views
timesyssupport
Senior Contributor II

Hi Nicholas,

Unfortunately, I'm not familiar with the library you are leveraging. Likely, we will need further details on the library to assist. Can you provides details on that here, or would you prefer to do so directly? If you would not mind creating an account and submitting a ticket at linuxlink.timesys.com, we can facilitate through there.

Thank you,

Timesys Support

0 Kudos
1,857 Views
oldyear
Contributor II

Yes I can make a timesys forums account and post the question over there.  Just to be clear though: Our application is running baremetal out of XIP Quadspi nand.  I'm merely trying to replicate what seems to be possible using the settings that timesys uses for their application space; granted I'm running timesys from a microsd card and not quadspi. I would've expected that to make the application run slower if anything though I understand applications are copied to the DDR onboard the tower board before running,  I don't believe the DDR interface would be faster than the iRAM in the vybrid.

The library provides image processing capabilities for biometric purposes (fingerprint verification) and is platform independent with the exception being that it was provided to us in compiled format for the ARM core in the Vybrid.  They provided us 4 versions of the library:

  • Regular
  • Regular with Hardware floating point support
  • Neon Core
  • Neon Core with Hardware Floating point support

I attempted to use each one but none of them had a significant effect on the speed of the algorithm.

0 Kudos
1,857 Views
timesyssupport
Senior Contributor II

Hi Nicholas,

As we aren't aware of this library, and are supporting Linux on Vybrid, I would suggest you contact the 3rd party vendor in this case, and inquire as to their optimizations. XIP/bare-metal queries on Vybrid can be addressed by a Freescale engineer, correct karinavalencia​?

Regards,

Timesys Support

1,857 Views
karina_valencia
NXP Apps Support
NXP Apps Support

timesyssupport​ can you help to review this case?

0 Kudos