Vybrid QSPI boot flow

ogj · ‎12-30-2018

I'm hoping someone up out there is familiar with the boot up flow of the Vybrid processor. I have an OTS SOM that will be booting the A5 core from QSPI flash (located on the SOM) in DDR x2 mode, with the code ending up in DDR memory. I’m trying to fill in the data structures and understand the boot flow. I understand the QuadSPI Configuration Parameter block Table 7-21 and the basics of the IVT.

My understanding of the first part of the boot flow with QSPI is:

-The QSPI clock is configured to run at 18 MHz. What speed is the processor clock?
-The QSPI pins are configured.
-Do basic QSPI read operation starting at flash location 0 (318 bytes) to get configuration parameters
-Re-configure the QSPI controller per the parameters
-Re-configure the clock to run the QSPI controller at 66 MHz (from parameter). What does this do to the processor clock?

I assume that the following (or something like it) is what happens next:

-The boot loader reads the first 4KB(??) from flash starting at location 0 into OCRAM memory starting at 0x3f00_0000(??). It then knows that the IVT will be at 0x3f00_0400.
-Executes any instructions in the DCD (set processor clock to 396 MHz and enable the DDR cntlr).
-Using the info in the BOOT data struct Table 7-54 (start and length), “length” bytes (the code image) are loaded into memory starting at location “start”. How does the boot loader know where the image is in flash memory?
-Boot loader “jumps” to the entry point “entry” given in the IVT.

Questions

How much of the above is actually correct?

When the QSPI clock is at 66 MHz, what speed is the A5 core running at? (I assume 396 MHz)

What location is the “boot stuff” copied into in OCRAM? If it is 0x3f00_0000, what happens when the app code is also loaded into the same space? How many bytes are read in?

How does the boot loader know where the app code is in flash memory?

Can the QSPI controller function properly in DDR mode during boot? I haven’t seen any of the examples use it. They all use SDR mode.

kef2 · ‎01-04-2019

Hi,

I was quite familiar with it few years ago but can give some hints.

Well, using non-XIP mode (copy all to RAM) with big DDR RAM makes sense. But using quite small on-chip SRAM I use XIP mode because the most of initialization routines and not time critical code can run happily from QSPI saving RAM space for data. Time critical code, also something like code to write some data back to QSPI memory is copied by C startup routine to RAM. You may initialize DCD struct with DDR RAM initialization commands, but if you have doubts about how QSPI boot initializes clocks and have even more doubts how this may interfere with DDR RAM setup, then you should rethink what's better for you, XIP boot followed by clocks reconfiguration for best possible CPU and DDR RAM speeds or non-XIP boot with not optimal clocks. I don't know if it's safe to reconfigure DDR without loosing code bits already copied to DDR?

-The QSPI clock is configured to run at 18 MHz. What speed is the processor clock?

Well, in XIP mode it doesn't matter, I reconfigure clocks in my code.

-The QSPI pins are configured.
-Do basic QSPI read operation starting at flash location 0 (318 bytes) to get configuration parameters

1k (0x400) is read with QSPI configuration settings like memory sizes, two chips parallel or single chip mode, CS hold/setup times etc. With parallel setup 1k is read from chip A.

-Re-configure the QSPI controller per the parameters

Yes

-Re-configure the clock to run the QSPI controller at 66 MHz (from parameter). What does this do to the processor clock?

Again, it doesn't matter for XIP boot.

I assume that the following (or something like it) is what happens next:

-The boot loader reads the first 4KB(??) from flash starting at location 0 into OCRAM memory starting at 0x3f00_0000(??). It then knows that the IVT will be at 0x3f00_0400.

It's not clear whether it is copied to OCRAM or handled at the fly, but next DCD table should be processed, else how could you perform for example DDR RAM setup before copying data to it? BTW, you may try using DCD to reconfigure device clocks for optimal setup!

-Executes any instructions in the DCD (set processor clock to 396 MHz and enable the DDR cntlr).

-Using the info in the BOOT data struct Table 7-54 (start and length), “length” bytes (the code image) are loaded into memory starting at location “start”. How does the boot loader know where the image is in flash memory?

-Boot loader “jumps” to the entry point “entry” given in the IVT.

Yes, something like this.

When the QSPI clock is at 66 MHz, what speed is the A5 core running at? (I assume 396 MHz)

I don't know, but it doesn't matter in XIP mode. Also you may craft your DCD table to reconfigure it like you wish.

You may have problems debugging DCD, try using device clock monitor pins. Yes, you need to setup them from the same DCD table. Since I used XIP, my DCD has 0 commands, just table size and DCD version.

What location is the “boot stuff” copied into in OCRAM? If it is 0x3f00_0000, what happens when the app code is also loaded into the same space? How many bytes are read in?

Since datasheets don't specify reserved boot locations in OCRAM -you shouldn't worry about it, i'm sure all RAM is available for your purposes.

How does the boot loader know where the app code is in flash memory?

Source is at least address of QSPI memory. Destination is specified in Boot data structure. (A little problem with source image in parallel mode. First kilobyte is read from A device 0..0x3FF. The rest is read in parallel mode, which means device address 0x400..0x4FF corresponds to destination offsets 0x??400..0x??5FF, 2x more data.)

Can the QSPI controller function properly in DDR mode during boot? I haven’t seen any of the examples use it. They all use SDR mode.

Yes, why not. You just need setup configuration struct accordingly. I tried to juice as much as possible from my QSPI memory, it has quad pin mode, also supports DDR mode, but at slower clock speed. 99MHz SDR performed faster than available fastest DDR mode.

Hope this helps

Edward

元の投稿で解決策を見る

ogj · ‎02-05-2019

I took a closer look at the Vybrid QuadSPI in the Datasheet (Rev 9 1/2018). The max frequency given in Table 49 (SDR mode) of 80 MHz and Table 51 (DDR mode) of 45 MHz are for writes to the flash - not reads. Obviously writes are slower than reads. The read time for DDR mode is:

The Tck shown in this diagram is not the same frequency as the one shown in the write diagram. My SOM has 80 MHz devices so I think running at 80 MHz is not out of the question. Just have to slow the clock for writes.

kef2 · ‎02-06-2019

Good eye! But what about address and command phases of instruction? It is still output/write, which table title talks about "QuadSPI Output/Write timing (xDR mode)". I would love better 1/Tck limit for reads. If DDR with learning works up to 108MHz on Vybrid Tower Board, I'd love 80 and even 66MHz.

BTW I was testing it in parallel mode.

I also noticed that I was only using 2 dummy cycles in DDR x4 mode. I increased that to 4 and DDR x4 started working.

You should not guess these figures but check your memory datasheets. But it is not easy to deduce reading and scrolling memory datasheets several times. For Spansion S25FL128S the numbers of dummy cycles are specified in four tables:

Two tables for "high performance" and two tables for "enhanced performance". The first or the second depends on part code, either ordering with enhanced feature or without it, or reading part information from the chip. LC in the table is configurable thing. You may reprogram register inside the memory chip(s) for one or another LC setting and thus for more or less dummy cycles. But you can't take any LC setting but should follow Freq. requirements in these tables. Also, keeping in mind DLP feature, only settings with more than 4 cycles will work with DLP. So you need 5, 6,.. dummy cycles. It was not clear for me are S25's on Tower board enhanced high performance or not, fortunately ED/EE commands have the same dummy cycles in both chip variants. Default LC=00 setting is already fine for DLP. Of course it is a wonder DDR reading was working up to 108MHz, while memory tops at 66.

You said you used PLL1 PFD3 and /3/2/2 dividers. 528/3/2/2=44, looks like you not use PFD but bypass it? Why not lowering divider and use finer PFD clock setting granularity to reach 45MHz? I used 2nd suggestion for QSPI in Table 6-5. Typical PFD Configuration, PLL3 PFD4. This is pll3_pfd4 fraction setting

ANADIG->PLL3_PFD = (ANADIG->PLL3_PFD & ~ANADIG_PLL3_PFD_PFD4_FRAC_MASK)
| ANADIG_PLL3_PFD_PFD4_FRAC(31);

Fpll3 = 480MHz.

Fpfd4 = Fpll3 * 18 / pfd4_frac; where pfd4_frac ranges from 12 to 35

Dividers 1/2/2 should be fine up to and even above 480/1/2/2=120 MHz. Fine tuning of QSPI clock is done changing PFD4_FRAC. PFD4_FRAC = 21,20,19 means 102.9, 108, 113.6MHz. With higher PLL1 clock I would have a bit finer granularity and perhaps could reach even faster DDR. Not really. 113.1 vs 113.6, still over 5MHz to go from 108.

Edward

ogj · ‎02-05-2019

Thanks for the feedback. I dropped the speed to 44MHz (PLL1:PFD4 @ 528 MHz /3 /2 /2 = 44 MHz), and I also noticed that I was only using 2 dummy cycles in DDR x4 mode. I increased that to 4 and DDR x4 started working. I know I had this working at 66 MHz at one time but in changing so many things, I must have changed the number of dummy cycles. It's very tempting to run faster, but for now I'm going to leave it at 44 MHz. I am using 0x34 as the DLP value. Since I have two identical parts I might try parallel in the future. Thanks again.

ogj · ‎02-01-2019

I finally tracked this down to the SOM that I am using (manufactured by Emcraft) can't reliably run at 66 MHz DDR x4. The best it can do (at 66 MHz) is x2 - even though the flash devices are rated at 80 MHz. I don't know whether this is a data transfer issue or what, but I don't have any more time to spend on it. I attached a couple of pictures of the data I'm seeing. Let me know what you think. Were you running the flash at 66 MHz or some other speed?

kef2 · ‎02-02-2019

Hi,

Well, the first thing users should do - read specifications :-). And vybridsec.pdf rev9 specifies max QSPI 1/Tck = 80MHz for SDR mode and max 1/Tck = 45MHz for DDR mode. Almost 2x worse DDR clock spec. kind of negates usefulness of DDR mode on Vybrid.

Yes, your pictures look like DDR sampling points are bad. Did you try changing DDR sampling point settings, perhaps this could help.

Edward

ogj · ‎02-02-2019

That explains a lot. Since I didn’t design the SOM, I didn’t pay enough attention to the specs. I thought I had seen 80MHz for DDR mode and 132 MHz for SDR mode. Turns out that was for the flash. It didn’t dawn on me to check the processor spec’s. I’m using DLP when in DDR mode. I’m under the impression that DLP sets the sample point automatically. Do you know if that is true?

kef2 · ‎02-05-2019

Yes, DLP should help setting DDRSMP sample point automatically, but

1) are you sure your memory chip is transmitting DLP? S25FLxxx have configurable latency code setting. Not all settings have enough dummy clocks to allow DLP transfer. There's also DLP pattern setting, are you sure it is something useful like 0x34 (should toggle on clock edges, as well in the middle between two clock edges) and not zero or 0xFF?

2) did you check Vybrid QSPI status bits, DLPFF? Once QuadSPI module executes DATA_LEARN sequence (what you show in your picture) it should set DLPFF bit in case of problem determining DDRSMP.

(Some) Smaller QSPI memories don't have DDR instructions so I decided to go better availability, code works well with bigger and smaller memories, so no DDR.

Regards

Edward

kef2 · ‎02-05-2019

You've made me curious to try DATA_LEARN. I made tower board reading QSPI in 4x DDR mode at 45MHz. Then enabled DATA_LEARN and raised frequency to 108MHz easily, no data failures. QuadSPI-SR-DLPSMP changes few units up and down with frequency. I disabled DATA_LEARN feature and can't go above 49MHz without struggling with DDR sampling settings. Nice feature, but specifications still don't allow to go above 45.. Perhaps it is an error, I don't know.

With DATA_LEARN enabled I see read failures even at 45MHz until Vybrid and memory DLP patterns match. DLP match is not enough. It is also necessary to have match between DATA_LEARN number of pins argument and real number of pins. This is bit weird. Spansion QSPI memory tells it transfers the same DLP on all pins. 8bits DLP in all 1x, 2x and 4x pins modes should take the same 4 clocks to transfer. So perhaps Vybrid checks all pins or something, I don't know.

This is excerpt from LUT initialization. #if 1 case is for DATA_LEARN disabled and #else case for DATA_LEARN enabled. Spansion Quad I/O read command (0xED/0xEE) with default LC==0 latency code setting has the same amount of dummy bits (6) in both HPLC and EHPLC tables. So you may see two LUT variants, 6 dummy bits not learning and 2 dummy + 4 data learn bits:

// SEQID 8 - Quad DDR Read
// quad ddr read - 24 bit addresses
QSPI->LUT[32]=   LUTi0(lCMD,      pOne, 0xED)
                 | LUTi1(lADDR_DDR, pFour, 24);
QSPI->LUT[33]=   LUTi0(lMODE_DDR, pFour, 0x55) // <--- complimentary nibbles = continue (ex. 0xA5)
#if 1
                   | LUTi1(lDUMMY,    pFour, 6);
QSPI->LUT[34]=   LUTi0(lREAD_DDR, pFour, 128) // // read 128 bytes    // 24013a80
                 | LUTi1(lJMP_ON_CS,pOne, 0);
QSPI->LUT[35]=0;
#else
| LUTi1(lDUMMY,    pFour, 2);
QSPI->LUT[34]=   LUTi0(lDATA_LEARN, pFour, 0x34)
         | LUTi1(lREAD_DDR, pFour, 128); // // read 128 bytes
QSPI->LUT[35]=   LUTi0(lJMP_ON_CS,pOne, 0);
#endif

Edward

ogj · ‎01-31-2019

Ran into an interesting phenomenon recently. As you know I am trying to write a "loader" program that gets loaded by the Vybrid ROM on reset. The loader program (executing from OCRAM) initializes the clocks and DDR controller, and using an optimized QSPI driver, copies my main program (setting in QSPI flash) into DDR memory. This gets around using the DCD to set up either the QSPI, DDR, or clocks.

In doing this, I've written a QSPI driver based on the one in u-boot. One of the problems I'm having is that it produces different results depending on whether I use it with my loader app (which is bare metal), or use it with my main app (which runs under MQX). I don't use the QSPI driver internal to MQX because of some problems it has. The biggest issue is that when I use my driver under MQX, and look at the data using IAR's examine memory function, it appears the way it should be. When I use the exact same driver in my bare metal app, the data is all messed up (nibbles out of order). If I halt my loader program and examine the flash memory (0x20000000+), the nibbles are out of order. When I halt my main app and examine the same area, it looks fine. I even checked the endian bit, but it's the same with both apps (little endian). I examined memory with J-Flash which is independent of code, and it shows the flash to be correct. There is no difference in the driver between the two apps. The only difference I can see is the memory copy function used in the driver. In the bare metal app, memory copy (as supplied by IAR) is done using load/store multiple instructions. Under MQX, memory copy (as supplied by Freescale) is done using the Neon coprocessor.

The IAR debugger has an examine memory function. The thing that's really crazy is that when I use this function to examine flash memory (0x2000000+) while running the two apps (loader and main apps) the results are different. Nothing is being reprogrammed (of course I have to initialize the QSPI controller, but I checked, and the registers are set up the same way in both apps).

I can program my loader program to manually manipulate the nibbles to make them correct, but that takes way too much time. Any ideas?

kef2 · ‎02-01-2019

Do you really see nibbles swapped and not bytes? Are you using QSPI DDR mode? If you had 4-pins mode + DDR, then perhaps bad phase delay could produce >>4 or <<4 shift and memory view similar to nibbles swapped?

I don't know what else could make nibbles swapped. You can't nibble swap even in parallel QUAD mode, for this Vybrid pin muxes should allow pins swapping QSPI_IOn_A with QSPI_IOn_B, which is not available.

Well I tried for my curiosity all QSPI modes in the past, 1x, 2x, 4x +DDR / noDDR. All worked well. Bad things of course may happen if you fail setting up QSPI lookup table (QSPI commands) properly, but I have no idea if and how these may lead to nibble swap.

Ah, do you see these problems when debugging from IDE or when booting from QSPI? If problems only while booting, then this may indicate some part of your initialization assumes reset default state, something more is needed to reinitialize after boot ROM. (DS-5 debugger helped there. After QSPI was programmed, I was unchecking run until _main option, loading only debug symbols, clicking reset in debugger, setting up HW breakpoint at program entry point, clicking continue. This made boot ROM code executed and stopped at my program entry point. This allowed to figure all troubles.)

Edward

kef2 · ‎01-04-2019

Hi,

I was quite familiar with it few years ago but can give some hints.

Well, using non-XIP mode (copy all to RAM) with big DDR RAM makes sense. But using quite small on-chip SRAM I use XIP mode because the most of initialization routines and not time critical code can run happily from QSPI saving RAM space for data. Time critical code, also something like code to write some data back to QSPI memory is copied by C startup routine to RAM. You may initialize DCD struct with DDR RAM initialization commands, but if you have doubts about how QSPI boot initializes clocks and have even more doubts how this may interfere with DDR RAM setup, then you should rethink what's better for you, XIP boot followed by clocks reconfiguration for best possible CPU and DDR RAM speeds or non-XIP boot with not optimal clocks. I don't know if it's safe to reconfigure DDR without loosing code bits already copied to DDR?

-The QSPI clock is configured to run at 18 MHz. What speed is the processor clock?

Well, in XIP mode it doesn't matter, I reconfigure clocks in my code.

-The QSPI pins are configured.
-Do basic QSPI read operation starting at flash location 0 (318 bytes) to get configuration parameters

1k (0x400) is read with QSPI configuration settings like memory sizes, two chips parallel or single chip mode, CS hold/setup times etc. With parallel setup 1k is read from chip A.

-Re-configure the QSPI controller per the parameters

Yes

-Re-configure the clock to run the QSPI controller at 66 MHz (from parameter). What does this do to the processor clock?

Again, it doesn't matter for XIP boot.

I assume that the following (or something like it) is what happens next:

-The boot loader reads the first 4KB(??) from flash starting at location 0 into OCRAM memory starting at 0x3f00_0000(??). It then knows that the IVT will be at 0x3f00_0400.

It's not clear whether it is copied to OCRAM or handled at the fly, but next DCD table should be processed, else how could you perform for example DDR RAM setup before copying data to it? BTW, you may try using DCD to reconfigure device clocks for optimal setup!

-Executes any instructions in the DCD (set processor clock to 396 MHz and enable the DDR cntlr).

-Using the info in the BOOT data struct Table 7-54 (start and length), “length” bytes (the code image) are loaded into memory starting at location “start”. How does the boot loader know where the image is in flash memory?

-Boot loader “jumps” to the entry point “entry” given in the IVT.