I'm hoping someone up out there is familiar with the boot up flow of the Vybrid processor. I have an OTS SOM that will be booting the A5 core from QSPI flash (located on the SOM) in DDR x2 mode, with the code ending up in DDR memory. I’m trying to fill in the data structures and understand the boot flow. I understand the QuadSPI Configuration Parameter block Table 7-21 and the basics of the IVT.
My understanding of the first part of the boot flow with QSPI is:
I assume that the following (or something like it) is what happens next:
Questions
How much of the above is actually correct?
When the QSPI clock is at 66 MHz, what speed is the A5 core running at? (I assume 396 MHz)
What location is the “boot stuff” copied into in OCRAM? If it is 0x3f00_0000, what happens when the app code is also loaded into the same space? How many bytes are read in?
How does the boot loader know where the app code is in flash memory?
Can the QSPI controller function properly in DDR mode during boot? I haven’t seen any of the examples use it. They all use SDR mode.
Solved! Go to Solution.
Hi,
I was quite familiar with it few years ago but can give some hints.
Well, using non-XIP mode (copy all to RAM) with big DDR RAM makes sense. But using quite small on-chip SRAM I use XIP mode because the most of initialization routines and not time critical code can run happily from QSPI saving RAM space for data. Time critical code, also something like code to write some data back to QSPI memory is copied by C startup routine to RAM. You may initialize DCD struct with DDR RAM initialization commands, but if you have doubts about how QSPI boot initializes clocks and have even more doubts how this may interfere with DDR RAM setup, then you should rethink what's better for you, XIP boot followed by clocks reconfiguration for best possible CPU and DDR RAM speeds or non-XIP boot with not optimal clocks. I don't know if it's safe to reconfigure DDR without loosing code bits already copied to DDR?
-The QSPI clock is configured to run at 18 MHz. What speed is the processor clock?
Well, in XIP mode it doesn't matter, I reconfigure clocks in my code.
1k (0x400) is read with QSPI configuration settings like memory sizes, two chips parallel or single chip mode, CS hold/setup times etc. With parallel setup 1k is read from chip A.
Yes
Again, it doesn't matter for XIP boot.
I assume that the following (or something like it) is what happens next:
-The boot loader reads the first 4KB(??) from flash starting at location 0 into OCRAM memory starting at 0x3f00_0000(??). It then knows that the IVT will be at 0x3f00_0400.
It's not clear whether it is copied to OCRAM or handled at the fly, but next DCD table should be processed, else how could you perform for example DDR RAM setup before copying data to it? BTW, you may try using DCD to reconfigure device clocks for optimal setup!
-Using the info in the BOOT data struct Table 7-54 (start and length), “length” bytes (the code image) are loaded into memory starting at location “start”. How does the boot loader know where the image is in flash memory?
Yes, something like this.
When the QSPI clock is at 66 MHz, what speed is the A5 core running at? (I assume 396 MHz)
I don't know, but it doesn't matter in XIP mode. Also you may craft your DCD table to reconfigure it like you wish.
You may have problems debugging DCD, try using device clock monitor pins. Yes, you need to setup them from the same DCD table. Since I used XIP, my DCD has 0 commands, just table size and DCD version.
What location is the “boot stuff” copied into in OCRAM? If it is 0x3f00_0000, what happens when the app code is also loaded into the same space? How many bytes are read in?
Since datasheets don't specify reserved boot locations in OCRAM -you shouldn't worry about it, i'm sure all RAM is available for your purposes.
How does the boot loader know where the app code is in flash memory?
Source is at least address of QSPI memory. Destination is specified in Boot data structure. (A little problem with source image in parallel mode. First kilobyte is read from A device 0..0x3FF. The rest is read in parallel mode, which means device address 0x400..0x4FF corresponds to destination offsets 0x??400..0x??5FF, 2x more data.)
Can the QSPI controller function properly in DDR mode during boot? I haven’t seen any of the examples use it. They all use SDR mode.
Yes, why not. You just need setup configuration struct accordingly. I tried to juice as much as possible from my QSPI memory, it has quad pin mode, also supports DDR mode, but at slower clock speed. 99MHz SDR performed faster than available fastest DDR mode.
Hope this helps
Edward
I used your verified files and they worked!! A second set of eyes can really help a lot. I think the key thing that I was doing wrong was I loaded the BFGENCR register AFTER enabling the controller (clearing the MDIS bit) instead of before. I see you added resetting the AHB and serial domains, however I don't know if those statements actually did anything because they need to come after enabling the controller.
NOTE: The software resets need the clock to be running to propagate to the design. The
MCR[MDIS] should therefore be set to 0 when the software reset bits are asserted. Also,
before they can be deasserted again (by setting MCR[SWRSTSD] to 0), it is recommended to
set the MCR[MDIS] bit to 1. Once the software resets have been deasserted, the normal
operation can be started by setting the MCR[MDIS] bit to 0.
I moved them to after the statement enabling the controller but it didn't make any difference in the outcome.
I really appreciate yourhelp with this as I was about ready to give up. You're a great resource!
Glad you are making progress.
As many other unlightened pieces, it is not stated in RM why BFGENCR would be needed to set up prior to MDIS->0. Of course unless something immediately makes read requests to QSPI AHB, which I guess doesn't happen in your case. It is stated only that SPTRCTL.BFPTRC should be set changing SEQID in BFGENCR. I guess it won't make any changes if you follow this requirement.
Yes, resetting serial and AHB domains don't make any change, just copied it from my artefacts comparing to your init routine. What I tried to chase previously is what is actually needed to reconfigure QSPI and before continuing executing in place (XIP). I told you previously that cache clear may be required. No. What is really required - forcing QSPI to read something before returning execution from RAM to QSPI and wait until complete. Again, why waiting? Shouldn't AHB wait until busy? Does it trigger any bus aborts? Dark box.
QSPI->MCR &= ~(QuadSPI_MCR_MDIS_MASK);
v= *(volatile long*)(some_address_in_qspi); // something not at FLASH base and not important
while(QSPI->SR & (QuadSPI_SR_BUSY_MASK | QuadSPI_SR_AHBTRN_MASK | QuadSPI_SR_AHBGNT_MASK))
{
}
Another undocumented feature is CCM->CGPR.QSPIn_ACCZ bit. It is told that it forces QSPI clock to be generated from platform bus clock. OK, but why would one need it? Or, is it really useless feature? No, not useless, QSPI (AHB?) seems to fail if QSPI clock is faster than platform bus clock (or IPG?)... So for XIP execution, if there's a need to reconfigure PLL1 clock, before switching to slow oscillator clock, one has to set QSPIn_ACCZ, and only then switch to oscillator clock.
Edward
Hi,
some observations regarding MDIS->0. Looks like toggling MCR_SWRSTxD was doing some good thing in my case. Not the action these bits are dedicated for, but delay they introduced. Dummy read from QSPI memory is not enough, probably doesn't happen at all, just long enough delay is required between MDIS->0 and exit from RAM to QSPI XIP. Looks like while loop wasn't doing any job in my case, SR status bits seem not working at all few cycles after setting MDIS 1->0. And this is what is enough at 500MHz cpu + 500/3 platform bus clock:
QSPI->MCR &= ~(QuadSPI_MCR_MDIS_MASK);
QSPI->MCR &= ~(QuadSPI_MCR_MDIS_MASK);
QSPI->MCR &= ~(QuadSPI_MCR_MDIS_MASK);
QSPI->MCR &= ~(QuadSPI_MCR_MDIS_MASK);
Yes, this depends on compiler optimizations and make take shorter or longer.. But I have no other good choice, QSPI status bits seem being useless during those several dead cycles after setting MDIS low. I guess the same happens in your case with BFGENCR. Previously I had it working with combination of toggling SMRSTxD bits + waiting while busy + invalidating instructions cache... Arghh, just simple delay is what was necessary.
Would be nice if anyone at NXP could look at the issue.
Edward
I did finally get some success. As far as I can tell, writing DLP into the flash is working fine. I added some code to read back the DLP value from the flash (command 0x41), and it displays the correct value. Therefore I left the code for writing the DLP the way it was. I did change the way I write the QUAD_ENB and LC bits to the way you suggested, and I think that may have caused it to work. The only combination I can get to work is LC = 0 with 6 dummy cycles (@ 66 MHz). Anything other than that produces garbage. The only glitch I have now is that the data appears one word early - for instance, the data actually programmed in location 0x40000 appears to be in location 0x3fff0. I know the data is programmed correctly as I can verify it in other ways. If I read the data (as it appears below) using DDR x2 no DLP), it appears correct - the value 0xea000007 appears in location 40000. I played with different LC and dummy cycles values, and cannot get it to change. I don't know what the differences are between DDR x2 and x4 other than the number of read lines.
I didn't send all the setup (clocks, etc.) because I figured you already had that from previous programs you've written. I use IAR tools and they include some of their proprietary stuff when I build a module. The IAR debugger gives me the ability to examine memory, so I just look at location 0x20000000+ to see what's in the flash. I don't bother to check all of memory, just the first few locations. That gives me an idea of whether everything is working. My programming code doesn't use this driver, and it does do a full comparison of flash vs the file used.
It's interesting that this is what is in u-boot for programming DLP and QUAD_ENB:
I assume it works.
Wasn't I clear? Code to send QUAD=1 (BTW where do you see QUAD_ENB, datasheet specifies it as QUAD???) and DLP=0x34 are fine, the problem is you don't wait after QUAD=1 for WIP=0! It's proven fact. I added read status command and added loop. Perfect even after cycling power on/off!
Ugh. Yes.
Please specify what dummy cycles you mean here? Total amount of dummies or dummies minus 4 for DLP and 6-4=2 in LUT table for DDRQIOR command? Yes, LC=0, total amount of 6 dummies, data learn are working fine. I didn't try different LC, may try tomorrow LC=1 or 2
No, reading is perfect. Perhaps you have another problem with write or again some misunderstanding with LC.
Are you using your write routines? For proper AHB reads, after leaving your routines with IP commands, SFAR has to be restored to point to bottom of flash. Perhaps that's your issue, don't remember if did something to SFAR in your code.
At some bad sampling point settings you may have just few bits broken within tens of kB's...
Don't see what you tried to share. If you mean something suspicious in u-boot, then no wonder since nobody's perfect :-).
Regards,
Edward
Verified, LC=0,1,2 are fine with dummy command argument in LUT = 2,3,4. LC=3 is not applicable since total 3 (less than 5) dummy cycles are not enough to send DLP.
See attached sources modified from you last attached files with wait for WIP, LC=0, dummy=6(dummy command argument=2), 80MHz.
I checked my code. I thought you had something on the DLP with
regset->TBDR = dlp;
However dlp is passed as an argument to the function write_DLP:
static void write_DLP
(
/* [IN] Pointer to the controller register set */
QSPI_MemMapPtr regset,
/* [IN] Base flash address */
uint32_t sfar,
/* [IN] DLP learning pattern */
uint32_t dlp
)
and the call to the function is:
// Send the DLP value
write_DLP(regset, FLASH_BASE_ADDRESS_A0, (DLP << 24));
where DLP is 0x34.
I'll try sending the bytes separately as you suggested. It's sad if that's what it takes to make it work.
Hi,
had time to look closer at it. (Would be fixed already if you provided working file set. No PLL1 config, no clock ungating etc).
If I run my code, which sets DLP properly, then run your code without cycling power, everything works well. If I cycle power, then run your code, it works up to about 62MHz or less, perhaps sampling point matches. Yes, I edited your code back to DDR AHB command and dummy=2 (+4 from data learn). So what do you think is wrong? If I cycle power again, pause debugger after QUAD=1 routine, then continue, it works well at least at 80MHz DDR (just didn't try more).
I need to say here that you didn't provide any means to verify memory reading. I added my code to receive hex file and compare memory contents. It takes a long from initialization to sending file... Now you see the whole picture...
The problem is QUAD=1 programming is like flash array programming, it takes time. You should loop reading flash memory status register until ready. Here SR1 WIP bit description:
And more detailed WIP bit description:
Edward
Byte ordering for sending DLP byte is not applicable as it is just one byte - 0x34. Sending the QUAD_ENB bit to the Configuration Register (CR1) in the flash is a different matter. To write CR1, you must also write SR1 and it becomes a two byte transfer. However I modelled my enable_quad_bit routine after the one in the u-boot driver (which I assume works). Here's the applicable part of the u-boot driver (you may be using this):
Sorry regarding SR/CR, yes you had it correct:
regset->TBDR = 0x00020000 | (LC << 22);
or
regset->TBDR = (first byte << 24) | ( second byte << 16);
or
regset->TBDR = (status << 24) | (cr1 << 16);
But DLP isn't correct. I see this in your code:
regset->TBDR = dlp;
You need to shift it left <<24. Yes, it is just one and *first* byte, so shift need to shift it left:
regset->TBDR = dlp << 24;
I used command sequences without arguments to set QUAD and DLP. This made DDR working with DATA_LEARN. Command seqs like this for QUAD and LC=0
LUT( VCMD_CMD, PAD_ONE, WRR)
LUT( VCMD_CMD, PAD_ONE, 0)
LUT( VCMD_CMD, PAD_ONE, 2)
and this for DLP
LUT( VCMD_CMD, PAD_ONE, WVDLR)
LUT( VCMD_CMD, PAD_ONE, 0x34)
Of course, both prepended with WREN command.
Edward
Among different issues, I think the root of your DLP issues is in overlooked byte order of TBDR/ RBDR registers. First byte read or written from/to flash comes/lands in bits 31:24 of RBDR/TBDR. You need to reverse byte order before/after writing/reading from TBDR/RBDR. Please check at least your DLP and LC bits programming routines. If not clear, instead of CMD+WRITE instructions you can use CMD+CMD[+CMD...] in one sequence, first CMD for command byte and the next CMD's for command argument like DLP or status + configuration.
Edward
If what you say is true then my driver has a bug in it that I cannot find. I attached it if you want to have a look. Kudos if you can find it. I have yet to run across a driver that works with this SOM in DDR x4 operation. The file QSPI Driver.c compiles with a couple of warnings. The write portion of the driver is not functional yet as I don't use writes in this particular application. The flash is written by another application (where writes work). I included an app "boot loader.c" so you can see what you need to do to use the driver - pretty simple. The flash LC value is defined in QSPI Driver prv.h. I use AHB reads (as you do if you're running XIP). The AHB sequence used is set in QSPI Handler.c. The driver as is is set to use 66 MHz, DDR x4, LC = 0, DLP = 0x34 - but it doesn't work.
Hi,
missing *.h dependencies found, doesn't compile.
Edward
So when I thought I had DDR x4 fixed, I found out I didn't. Concentrating on running at 66 MHz, I tried all combinations of 2 dummy cycles and 4 dummy cycles (even 6); and LC = 0,1,2,3 - none of them worked. LC = 2 never worked. The other combinations just produced scrambled results. I guess I'm convinced that this SOM can't do x4 operations for some reason. I don't know if it's poor design or what. But I'm getting tired of working on it.
Hi,
certainly any SOM with Vybrid can do x4 operations. High clocks and even more DDR + high clocks of course can be problematic. Instead of trying all LC and DLP combitations you should make sure current LC of your memory matches dummy cycles setting in Vybrid QuadSPI commands look up table (LUT). You should also program good value to memory DLP register and as well make sure it matches DATA_LEARN command argument in Vybrid QuadSPI commands LUT. With both set properly on both sides (memory + Vybrid), I'm sure it would work even on poorest PCB. Using PFD fine clock tuning you could rise QuadSPI clock until bad readings. That's good way to go. But yes, non DDR mode is less problematic.
Edward
The SOM is made by Emcraft (www.emcraft.com/products/259). The flash parts are 2x S25FL512SDSBHV210 devices (datasheet attached). These are the enhanced high performance 80 MHz parts. The SOM manual (attached) says the parts are from Spansion, however I think they're actually from Cypress. I haven't kept up, but I assume the Cypress purchased Spansion so they inherited the parts (like NXP now owns Freescale). I'll try playing with the latency bits when I get the chance. I did try different numbers of dummy cycles (4 - 8) at 80 MHz, but I still couldn't get 80 MHz to work. Table 23 in the attached spec shows that 80 MHz, 6 dummy cycles, and LC = 00 should work, but it didn't (I'm still waiting for Cypress to get back to me regarding the flash parts). FYI: Emcraft makes an eval board for this SOM so it can fit with the other tower boards.
Hi,
LC and DLP pattern in this is memory chip are programmable, both nonvolatile. So you either need to send them to thee chip every time or program nonvolatile values.
Default DLP setting is wrong and not usable, 0x00. Should be 0x34 or equally good for DDR data learning. For data learning feature you need to a) set this setting on every power using WREN + WVDLR commands, or b) program DLP OTP setting using WREN + PNVDLR.
LC setting is nonvolatile and is set along with QUAD enable bit in the same WRR command. LC setting itself is like I wrote above along with tables, which are the same. What I don't understand is Table 23. Latency Codes for DDR Enhanced High Performance in your datasheet. Two rows for LC= 00, 01, 10, one for <66MHz, another one for <80MHz. I guess which row depends on speed grade of your chip, see ordering information:
AG = 133 MHz
DP = 66 MHz DDR
DS = 80 MHz DDR
I think DS has 80MHz limit, DP - 66 and in that odd table 23 you should look for rows with 80 and 66 MHz.
Default LC setting of 00 according to table 23 and 21, which one applies to you again depends on code of you chip, see EHLPC vs HLPC in ordering information. Fortunately both tables are the same for DDR I/O Quad commands. So, LC =0 means 6 dummy cycles, which should turn in Vybrid LUT into a) 6 dummy cycles + no data learning, or b) 2 dummy cycles + data learning (4 cycles). How could it work for you with 4 dummy cycles I don't know, perhaps you set LC=10, then you will have total 8 dummy cycles (4 really dummy + 4 for DLP).
Edward
I use PLL1:PFD4, which with the MQX default settings runs at 528 MHz. Using the dividers in CCM_CSCDR3, I can divide the 528 to get 33 MHz, 44 MHz, or 66 MHz. All three of these speeds now work with 4 dummy cycles and DLP. I tried 80 MHz using PLL3:PFD4, but I couldn't find any dummy cycle setting that worked. It always came out garbage. I am checking with Cypress on the LC bits in CR1 of the flash device. Is the user supposed to program those? When I set the QUAD_ENB bit in CR1, I always send 0x02, which means if these bits can be programmed, I'm always setting them to b00.
Which Cypress device are you using? Datasheet URL? Yes, proper latency code should be programmed to memory chip (every time at startup or by programming nonvolatile value) or default value should be used. At higher frequencies you may need higher latency setting (dummy bits), you need to check your datasheets.