Background:
I'm using the same processor as the Vybrid tower board the MVF61NS. This is the package with L1,L2 cache clocked at 500mhz. We are running our entire application bare-metal from spansion quadspi XIP (no external RAM at all). We're using a processing heavy commercial library that is a black box to us (provided as a .lib). Also we are using the ARM DS-5 development suite.
Issue:
The library purportedly runs 7 times faster than the results we're getting. The supplier provided a binary to us that supports that as well but it only runs on top of the timesys Linux distro that is packaged with the tower board: I was able to run the binary on our board with linux and see the fast results which leads me to believe it is possible though we can't use Linux in our application.I have enabled caching on the quadspi section on our image and I see significant performance gains but we are still 7 times slower than running off the linux board.
I did a little bit more sleuthing to make sure that the quadspi program data was indeed being cached: Using an oscilloscope I probed the clock line on the quadspi chip during our library function calls and only see clocks during less than 5% of the time. This suggests to me that a majority of the program is already cached.
Questions:
What would be the best settings for the MMU for speed running quadspi XIP?
Can anyone think of any other register settings that could need to be enabled to get a performance bump like this?
Many thanks in advance!
Solved! Go to Solution.
Nicholas,
did you enable data cache? Instruction cache I think is enabled out of reset, but not data cache. And if you are using DS-5 compiler (perhaps applies to others to), it likes to put a lot of constants to code, loadable data constants, not immediate addressing constants, which could be cached with instructions cache. Yes, enabling data cache and making (code areas as well) cached, you easily can get ~7x speedup. To enable data cache you need first to setup MMU tables properly, which is not trivial and not universal for each app.
Not at all. QSPI controller itself is caching device. Once a row of data is read from NAND device to QSPI RAM, QSPI won't read from NAND device until you request data not cached in QSPI RAM.
Both the icache and the dcache are enabled at startup. Copy of the SCTLR:
SCTLR | 0x00C5187D | 32 | R/W |
TE | Disabled | 1 | R/W |
AFE | Disabled | 1 | R/W |
TRE | Disabled | 1 | R/W |
EE | Clear | 1 | R/W |
HA | Clear | 1 | R/W |
RR | Clear | 1 | R/W |
V | Normal | 1 | R/W |
I | Enabled | 1 | R/W |
Z | Enabled | 1 | R/W |
SW | Disabled | 1 | R/W |
C | Enabled | 1 | R/W |
A | Disabled | 1 | R/W |
M | Enabled | 1 | R/W |
The C and the I members being the Data and Instruction cache enables respectively.
The 0x20000000 block corresponds to the only part of the quadspi i use for executable space
The 0x3F000000 block corresponds to the usable part of the general purpose iRAM. Heap and Stack live here. I have tried disabling caching on this section but found no difference.
Memory Map:
Virtual Address | Physical Address | Type | AP | Cachable | Shared | Executable |
S:0x00000000-0x000FFFFF | SP:0x00000000-0x000FFFFF | Normal | RW | Y | N | Y |
S:0x00100000-0x1FFFFFFF | SP:0x00100000-0x1FFFFFFF | Strongly-ordered | RW | N | Y | Y |
S:0x20000000-0x200FFFFF | SP:0x20000000-0x200FFFFF | Normal | RW | Y | N | Y |
S:0x20100000-0x3EFFFFFF | SP:0x20100000-0x3EFFFFFF | Strongly-ordered | RW | N | Y | Y |
S:0x3F000000-0x3F0FFFFF | SP:0x3F000000-0x3F0FFFFF | Normal | RW | Y | N | Y |
S:0x3F100000-0x3F3FFFFF | SP:0x3F100000-0x3F3FFFFF | Strongly-ordered | RW | N | Y | Y |
S:0x3F400000-0x3F4FFFFF | SP:0x3F400000-0x3F4FFFFF | Normal | RW | Y | N | Y |
S:0x3F500000-0xFFFFFFFF | SP:0x3F500000-0xFFFFFFFF | Strongly-ordered | RW | N | Y | Y |
Specific Table Entries:
0x20000000 | Section | SP:0x20000000 | NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0 |
0x3F000000 | Section | SP:0x3F000000 | NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0 |
0x3F100000 | Section | SP:0x3F100000 | NS=0, nG=0, S=0, AP=0x2, TEX=0x0, Domain=0, XN=0, C=0, B=0, PXN=0 |
0x3F400000 | Section | SP:0x3F400000 | NS=0, nG=0, S=0, AP=0x3, TEX=0x0, Domain=0, XN=0, C=1, B=0, PXN=0 |
Cortex-A5 rev r0p1 Technical Reference Manual:
6.2.1 Memory types
Although various different memory types can be specified in the page tables, the Cortex-A5 processor does not implement all possible combinations:
• Write-through caches are not supported. Any memory marked as write-through is treated as Non-cacheable.
• The outer shareable attribute is not supported. Anything marked as outer shareable is treated in the same way as inner shareable.
• Write-back no write allocate is not supported. It is treated as write-back write-allocate.
Looks like TEX=0, C=1, B=0 is write through, no write allocate. Try setting B=1.
Edward
Edward,
Thanks for the feedback. I just tried your suggestion: there was no perceptible change. If I set TEX=0, C=0 and B=0 there is clearly a significant drop in speed as if the code is not being cached any longer. Not sure what to think of that.
Thanks,
Nick
Did you enable L2 cache?
Edward
I am calling the L2 Cache enable functions. Do you know which registers I can check to confirm that it is actually being enabled?
Thanks,
Nick
Try getting from arm.com
AMBA® Level 2 Cache Controller
(L2C-310)
Revision: r3p1
Technical Reference Manual
(DDI0246E_l2c310_r3p1_trm.pdf)
L2 controller base in VF6 is 0x40060000. reg1_control register with enable bit is at L2 base + 0x100, but I guess you need to configure other registers.
jiri-b36968 do you have an update?
jiri-b36968 can you comment?
reminder
Jiri Kotzian can you comment?
Hi Nicholas,
Unfortunately, I'm not familiar with the library you are leveraging. Likely, we will need further details on the library to assist. Can you provides details on that here, or would you prefer to do so directly? If you would not mind creating an account and submitting a ticket at linuxlink.timesys.com, we can facilitate through there.
Thank you,
Timesys Support
Yes I can make a timesys forums account and post the question over there. Just to be clear though: Our application is running baremetal out of XIP Quadspi nand. I'm merely trying to replicate what seems to be possible using the settings that timesys uses for their application space; granted I'm running timesys from a microsd card and not quadspi. I would've expected that to make the application run slower if anything though I understand applications are copied to the DDR onboard the tower board before running, I don't believe the DDR interface would be faster than the iRAM in the vybrid.
The library provides image processing capabilities for biometric purposes (fingerprint verification) and is platform independent with the exception being that it was provided to us in compiled format for the ARM core in the Vybrid. They provided us 4 versions of the library:
I attempted to use each one but none of them had a significant effect on the speed of the algorithm.
Hi Nicholas,
As we aren't aware of this library, and are supporting Linux on Vybrid, I would suggest you contact the 3rd party vendor in this case, and inquire as to their optimizations. XIP/bare-metal queries on Vybrid can be addressed by a Freescale engineer, correct karinavalencia?
Regards,
Timesys Support
timesyssupport can you help to review this case?