Problems getting IMX6Q MMU/L1/L2 cache to work

MikeJones · ‎11-16-2013

I have been trying to get an SMP RTOS HAL working using the SDK. I looked at the primes example for how to enable cache and mmu, but for some reason lookups only work on CPU 0, and return 0 on CPU 1-3. I verified this by calling mmu_virtual_to_physical. I am testing it address 0x0021bc430 which is where my startup code for secondary CPUs gets the CPU number and then gets into trouble. Other addresses in DDR have the same behavior. Without caches and MMU the HAL works fine. So there are no general memory problems.

I have read all the ARM docs and just don't see what could be wrong. The primes example works fine.

Can you explain the order of initialization of the L1 cache and MMU so I can understand the dependencies?

The ARM docs clearly say the caches must be invalidated before "used." They don't say invalidated before enabled, but I assumed they mean that. The primes example enable first, then invalidate. That seems dangerous if an instruction fetch happens in the process.

It is also not clear if the caches work when the MMU is disabled. Can you also explain any dependencies when using L1, L2, MMU in different combinations or by themselves?

Can you think of what might be missing that causes the secondary cores to translate to address 0?

One thing that makes me nervous is that the tables are in OCRAM which are configured to be under MMU control. Seems like the MMU has to use itself to translate. I am guessing the MMU is designed to tolerate that.

One more curiosity, when I put the tables at the bottom of OCRAM, all kinds of weird stuff happened on CORE 0. Are there any rules on where to place the tables? There are registers in the init that tell the MMU where they are, but I saw in the docs that the standard layout reserves some memory at the bottom of OCRAM. So I wonder if the MMU has some limitations on where the tables are located.

AnsonHuang · ‎11-19-2013

Hi, Mike

If you are starting multiple cores on an iMX6Q, does this mean the L2 cache should be enabled after all 4 cores have their L1 D Caches enabled?

[Anson] No, L2 cache can be enabled after core0's L1 cache enabled.

The Linux resume code seems to configure the L2 cache and enable it on each core.

[Anson] The resume code only executed on core0, kernel will start up secondary cores later.

For your case, you had better to look through linux kernel code, as I did NOT run SMP multi cores without linux, so I have no idea about it, my understanding is that as long as you setup mmu table, cache for every core, they should be working OK.

View solution in original post

MikeJones · ‎11-16-2013

I tried a couple more things and have some details on my code that might help. The main thing I did that was new was add the L2 cache with the patch on the previous thread.

Let's start with the code:

// CPU 0

disable_strict_align_check();

mmu_init();

arm_icache_enable();

arm_dcache_invalidate();

mmu_enable();

arm_dcache_enable();

_l2c310_cache_setup();

_l2c310_cache_invalidate();

_l2c310_cache_enable();

scu_enable();

// PER CPU (0-4)

disable_strict_align_check();

arm_branch_target_cache_invalidate();

arm_branch_prediction_enable();

arm_dcache_enable();

arm_dcache_invalidate();

arm_icache_enable();

arm_icache_invalidate();

scu_secure_invalidate(cpu, all_ways);

scu_join_smp();

scu_enable_maintenance_broadcast();

This code is the same as the ordering in the primes example but with the L2 cache enable.

This code will work on CPU 0 just fine.

There are two test cases with different symptoms:

Case 1

----------

Remark the three L2 lines. In this case, the secondary CPUs will startup. It enters some assembly code that is part of my startup code where it sets SPSR, fixes stacks, etc, and ends up in C code where it tries to get the CPU. The code for getting the CPU puts the right value in r0, but when the C call pops, the local variable is zero. Other variables are also 0. It is a little hard to be sure what is what because I don't completely know how the PEEDI JTAG debugger works. I assume it reads variables with a mechanism that does not bypass the caches, and what I see is what code sees. Given some conditional statements interacting with memory that has zeros, I believe that to be the case. I conclude that the instruction cache is working because I can step in code. It just gets in trouble with data.

Case 2

----------

Leave the L2 cache active. In this case the secondary CPUs do not start, because I never hit a breakpoint. I have different code for this than the standard SDK. Here is the code:

case 1:

HW_SRC_GPR3_WR((uint32_t) & start);

HW_SRC_GPR4_WR((uint32_t) common_cpu_entry);

if (HW_SRC_SCR.B.CORE1_ENABLE == 1)

HW_SRC_SCR.B.CORE1_RST = 1;

else

HW_SRC_SCR.B.CORE1_ENABLE = 1;

break;

What is different is the code checks if the CORE is enabled and if it is, resets it. The reason for this is the PEEDI SMP debugger enables all four cores after reset and I can't change that behavior.

Working Case

--------------------

If all the code for the MMU and L1/L2 caches is remarked out, start up of the other CPUs is fine, and the application runs properly. I have worked without these for several weeks without any problems, so I trust the overall application.

My assembly startup different than the SDK. But at the end of the day it sets up the vector table, fixes SPSR, creates the stacks for all cores, etc. The main code runs as supervisor. I NOT not trying to run in user mode.

My DCD startup came from the SDK. I took the DCD data and put it in the PEEDI setup. I have used this with JLink, Mentor's Personal Probe, all without any issues. I assume that this code does not depend on anything in the DCD data. If it does, I would like to know what matters.

One last matter, I am putting the MMU table in the same place the SDK puts it in OCRAM wrt to these three cases. My stacks are in DDR, not in OCRAM.

MikeJones · ‎11-19-2013

I got a little farther by calling flush on the caches. This allows the secondary CPUs to start with L2 enabled. There is a barrier instruction in the dcache flush. It is not clear if I need one more after the L2 flush. Does anyone know?

Then when there is a reset for CPU 1-3, I added assembly code that invalidates the L1 icache and L1 dcache immediately after the reset. This is what I found that Linux code does on starting a secondary cpu. Assembly code does not use memory so it executes properly.

After that C code executes on a secondary CPU, but the code does not get far before it translates a virtual address to physical address 0. That was the original problem. So the only thing that I fixed was allowing the enable of L2 and the code then failing in the same place as before. Somehow after the reset of cores 1-3 the MMU is does not work (L1/L2 enabled).

So the question is still what can prevent address translation on the secondary CPUs?

One conclusion that can probably be made, is the L2 cache is working, or a flush would not be required. Without the flush, the entry point was probably in place in physical memory.

This kind of hints to the MMU needing some kind of initialization. I saw one comment on the internet suggesting the A9 had one MMU per core. Is this the case?

CHANGE WITH FLUSH

-----------------------------------

void hal_cpu_start_secondary(cyg_uint8 coreNumber, cpu_entry_point_t entryPoint, void * arg)

{

...

s_core_info[coreNumber].entry = entryPoint;

s_core_info[coreNumber].arg = arg;

hal_dcache_flush();

l2c310_cache_flush();

switch (coreNumber)

{

case 1:

HW_SRC_GPR3_WR((uint32_t) & start);

HW_SRC_GPR4_WR((uint32_t) common_cpu_entry);

if (HW_SRC_SCR.B.CORE1_ENABLE == 1)

HW_SRC_SCR.B.CORE1_RST = 1

AnsonHuang · ‎11-19-2013

Hi, Mike

Although I can NOT tell what is wrong with your code, as I did NOT have environment to debug it, but I would like to share some info according to your question as far as I know.

Can you explain the order of initialization of the L1 cache and MMU so I can understand the dependencies?

[Anson] L1 cache has I cache and D cache, I cache can work without MMU enabled, so you can invalidate L1 I cache then enable it as anytime you want. L1 D cache can only work when MMU is enabled, so the process should be: create MMU table first, enable MMU, invalidate D cache, then enable D cache, you can refer to uboot code(with MMU & D cache enabled) or the resume code of v3.0.35 linux kernel: arch/arm/mach-mx6/mx6_suspend.S. The safest way is to invalidate both I cache and D cache whenever you want to enable them. For L2 cache, it is an unify cache, you can only enable it after L1 D cache is enabled.

Can you think of what might be missing that causes the secondary cores to translate to address 0?

[Anson] Each core has its own MMU/cache, so you need to do all same action for all cores, and you also need to enable SCU which will maintain L1 cache among all the cores.

This kind of hints to the MMU needing some kind of initialization. I saw one comment on the internet suggesting the A9 had one MMU per core. Is this the case?

[Anson] Yes, every core has its own MMU/L1 cache, you can refer to our BSP's flow of booting up secondary cores. And you can see that when we did secondary cores reset/enable, we pass a phisical address to its entry, which means it is booting up with MMU disabled, then after it boots to secondary kernel, it initialize its own MMU/Cache etc.

MikeJones · ‎11-19-2013

Yongcai,

I have a question on a detail. You said that you can only enable the L2 cache after the L1 D cache is enabled.

If you are starting multiple cores on an iMX6Q, does this mean the L2 cache should be enabled after all 4 cores have their L1 D Caches enabled?

The Linux resume code seems to configure the L2 cache and enable it on each core. That does not make sense to me because the L2 cache is shared. However, I might not completely understand its code, so I thought I better ask.

I created the following code for enabling cores 1-3 after core 1 is enabled, meaning both L1 caches are enabled, the MMU is setup, and the L2 cache is enabled on core 0. But even this code results in failure of the MMU to translate addresses. Note that this code does not enable the L2 cache. It was enabled when core 0 was setup. Furthermore, there are no problems with virtual to physical translation for core 0.

Perhaps you can look at the order of the actions in the code and see if you can spot something that might be wrong. I based this code on the Linux code as best as I can understand what it does.

.macro _setup2

nop

// Disable ints and turn Supervisor mode

mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)

msr cpsr, r0

// Prepare SPSR_SVC for work

mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)

msr spsr_cxsf, r0

mrs r0, spsr

// Invalidate L1 I-cache

mov r1, #0x0

mcr p15, 0, r1, c7, c5, 0 @ Invalidate I-Cache

mcr p15, 0, r1, c7, c5, 6 @ Invalidate Branch Predictor

mov r1, #0x1800

mcr p15, 0, r1, c1, c0, 0 @ Enable I-Cache and Branch Predictor

isb

// Invalidate L1 D-cache

mov r0, #0

mcr p15, 2, r0, c0, c0, 0

mrc p15, 1, r0, c0, c0, 0

ldr r1, =0x7fff

and r2, r1, r0, lsr #13

ldr r1, =0x3ff

and r3, r1, r0, lsr #3 @ NumWays - 1

add r2, r2, #1 @ NumSets

and r0, r0, #0x7

add r0, r0, #4 @ SetShift

clz r1, r3 @ WayShift

add r4, r3, #1 @ NumWays

10: sub r2, r2, #1 @ NumSets--

mov r3, r4 @ Temp = NumWays

20: subs r3, r3, #1 @ Temp--

mov r5, r3, lsl r1

mov r6, r2, lsl r0

orr r5, r5, r6 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift)

mcr p15, 0, r5, c7, c6, 2

bgt 20b

cmp r2, #0

bgt 10b

dsb

// Sem MMU table address

ldr r0, =__mmu_tables_start

mcr p15, 0, r0, c2, c0, 0 @ TTBR0

// Set Client mode for all Domains

ldr r0, =0x55555555

mcr p15, 0, r0, c3, c0, 0 @ DACR

// Invalidate TLB

mov r0, #0

mcr p15, 0, r0, c7, c5, 4 @ Flush prefetch buffer

mcr p15, 0, r0, c8, c5, 0 @ Invalidate ITLB

mcr p15, 0, r0, c8, c6, 0 @ Invalidate DTLB

// Enable MMU

mrc p15, 0, r0, c1, c0, 0

orr r0, r0, #1

mcr p15, 0, r0, c1, c0, 0

isb

dub

// Enable I and D Caches

mrc p15, 0, r0, c1, c0, 0 @ SCTLR

orr r0, r0, #0x1000

orr r0, r0, #0x0004

mcr p15, 0, r0, c1, c0, 0 @ SCTLR

dsb

isb

.endm

AnsonHuang · ‎11-19-2013

Hi, Mike

If you are starting multiple cores on an iMX6Q, does this mean the L2 cache should be enabled after all 4 cores have their L1 D Caches enabled?

[Anson] No, L2 cache can be enabled after core0's L1 cache enabled.

The Linux resume code seems to configure the L2 cache and enable it on each core.

[Anson] The resume code only executed on core0, kernel will start up secondary cores later.

For your case, you had better to look through linux kernel code, as I did NOT run SMP multi cores without linux, so I have no idea about it, my understanding is that as long as you setup mmu table, cache for every core, they should be working OK.

MikeJones · ‎11-20-2013

Yongcai,

There was a subtle but important difference in the Linux code. When a secondary core is started, SMP is joined very early in the process, and the snoop control is also enabled very early. With this change, my cores now all run with all the caches and MMU.

I will document it here for the next person that attempts to make this work. It will also help someone add an example for the SDK.

First, the code used to startup CPU 0:

hal_disable_strict_align_check();

hal_branch_prediction_disable();

hal_icache_disable();

hal_dcache_disable();

hal_icache_invalidate();

hal_dcache_invalidate();

hal_icache_enable();

hal_mmu_init();

l2c310_cache_setup();

hal_mmu_enable();

hal_dcache_invalidate();

hal_dcache_enable();

l2c310_cache_invalidate();

l2c310_cache_enable();

This code works because the MMU is using the 1:1 mapping in the SDK. At no point will things get confused. In the Linux code, it is a little different because after the MMU is setup and enabled, execution has to change to a new location that I think is being mapped from virtual to physical. The SDK is all 1:1 mapping, which is fine because you don't have to have relocatable programs.

Don't worry about hal_ in front of the calls. They are more or less the same functions as the SDK.

Now for starting the secondary CPUs:

nop

mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)

msr cpsr, r0

mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)

msr spsr_cxsf, r0

mrs r0, spsr

// Join SMP and Enable SCU broadcast.

mrc p15, 0, r0, c1, c0, 1 @ Read ACTLR

orr r0, r0, #0x041 @ Set bit 6 and 1

mcr p15, 0, r0, c1, c0, 1 @ Write ACTLR

// Invalidate L1 I-cache

mov r1, #0x0

mcr p15, 0, r1, c7, c5, 0 @ Invalidate I-Cache

mcr p15, 0, r1, c7, c5, 6 @ Invalidate Branch Predictor

mov r1, #0x1800

mcr p15, 0, r1, c1, c0, 0 @ Enable I-Cache and Branch Predictor

isb

// Invalidate L1 D-cache

mov r0, #0

mcr p15, 2, r0, c0, c0, 0

mrc p15, 1, r0, c0, c0, 0

ldr r1, =0x7fff

and r2, r1, r0, lsr #13

ldr r1, =0x3ff

and r3, r1, r0, lsr #3 @ NumWays - 1

add r2, r2, #1 @ NumSets

and r0, r0, #0x7

add r0, r0, #4 @ SetShift

clz r1, r3 @ WayShift

add r4, r3, #1 @ NumWays

10: sub r2, r2, #1 @ NumSets--

mov r3, r4 @ Temp = NumWays

20: subs r3, r3, #1 @ Temp--

mov r5, r3, lsl r1

mov r6, r2, lsl r0

orr r5, r5, r6 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift)

mcr p15, 0, r5, c7, c6, 2

bgt 20b

cmp r2, #0

bgt 10b

dsb

// Set MMU table address

ldr r0, =__mmu_tables_start

mcr p15, 0, r0, c2, c0, 0 @ TTBR0

// Set Client mode for all Domains

ldr r0, =0x55555555

mcr p15, 0, r0, c3, c0, 0 @ DACR

// Invalidate TLB

mov r0, #1

mcr p15, 0, r0, c8, c7, 0 @ TLBIALL - Invalidate entire unified TLB

dsb

// Enable MMU

mrc p15, 0, r0, c1, c0, 0

orr r0, r0, #1

mcr p15, 0, r0, c1, c0, 0

isb

dsb

// Enable I and D Caches

mrc p15, 0, r0, c1, c0, 0 @ SCTLR

orr r0, r0, #0x1000

orr r0, r0, #0x0004

mcr p15, 0, r0, c1, c0, 0 @ SCTLR

dsb

isb

There are a few things to consider here. The first couple of lines fix the SPSR and CPSR. SPSR is not initialized when the core is started. If you don't fix SPSR, you will spend days and days going crazy. In my case, CPU 3 always failed to start, but sometimes CPU 2 would fail, and CPU 1 always worked.

Second, I was concerned about running code running on physical memory on CPU 1-3 while CPU 0 was running with the MMU and caches, during setup of CPU 1-3. So I wrote this code in assembly. I may go back and write the CPU 0 code in assembly, or I may try this code in C. The main thing is this code works, so it documents the order that things have to happen in to work.

I have not throughly tested this, but I have run my app a bunch of times and it works just like it did without any MMU and caching. If I find any problems, I will post what I find.

Yongcai, thanks for the help.

AnsonHuang · ‎11-20-2013

Hi, Mike

Yes, the SMP must be set before enabling secondary cores' cache, as SCU need to maintain all caches among cores, otherwise, the memory may be corrupt due to different operation on different core's cache. I am glad that you have solved this issue, thanks.

Problems getting IMX6Q MMU/L1/L2 cache to work

Problems getting IMX6Q MMU/L1/L2 cache to work

i.MX6Quad