Yongcai,
There was a subtle but important difference in the Linux code. When a secondary core is started, SMP is joined very early in the process, and the snoop control is also enabled very early. With this change, my cores now all run with all the caches and MMU.
I will document it here for the next person that attempts to make this work. It will also help someone add an example for the SDK.
First, the code used to startup CPU 0:
hal_disable_strict_align_check();
hal_branch_prediction_disable();
hal_icache_disable();
hal_dcache_disable();
hal_icache_invalidate();
hal_dcache_invalidate();
hal_icache_enable();
hal_mmu_init();
l2c310_cache_setup();
hal_mmu_enable();
hal_dcache_invalidate();
hal_dcache_enable();
l2c310_cache_invalidate();
l2c310_cache_enable();
This code works because the MMU is using the 1:1 mapping in the SDK. At no point will things get confused. In the Linux code, it is a little different because after the MMU is setup and enabled, execution has to change to a new location that I think is being mapped from virtual to physical. The SDK is all 1:1 mapping, which is fine because you don't have to have relocatable programs.
Don't worry about hal_ in front of the calls. They are more or less the same functions as the SDK.
Now for starting the secondary CPUs:
nop
mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)
msr cpsr, r0
mov r0, #(CPSR_IRQ_DISABLE|CPSR_FIQ_DISABLE|CPSR_SUPERVISOR_MODE)
msr spsr_cxsf, r0
mrs r0, spsr
// Join SMP and Enable SCU broadcast.
mrc p15, 0, r0, c1, c0, 1 @ Read ACTLR
orr r0, r0, #0x041 @ Set bit 6 and 1
mcr p15, 0, r0, c1, c0, 1 @ Write ACTLR
// Invalidate L1 I-cache
mov r1, #0x0
mcr p15, 0, r1, c7, c5, 0 @ Invalidate I-Cache
mcr p15, 0, r1, c7, c5, 6 @ Invalidate Branch Predictor
mov r1, #0x1800
mcr p15, 0, r1, c1, c0, 0 @ Enable I-Cache and Branch Predictor
isb
// Invalidate L1 D-cache
mov r0, #0
mcr p15, 2, r0, c0, c0, 0
mrc p15, 1, r0, c0, c0, 0
ldr r1, =0x7fff
and r2, r1, r0, lsr #13
ldr r1, =0x3ff
and r3, r1, r0, lsr #3 @ NumWays - 1
add r2, r2, #1 @ NumSets
and r0, r0, #0x7
add r0, r0, #4 @ SetShift
clz r1, r3 @ WayShift
add r4, r3, #1 @ NumWays
10: sub r2, r2, #1 @ NumSets--
mov r3, r4 @ Temp = NumWays
20: subs r3, r3, #1 @ Temp--
mov r5, r3, lsl r1
mov r6, r2, lsl r0
orr r5, r5, r6 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift)
mcr p15, 0, r5, c7, c6, 2
bgt 20b
cmp r2, #0
bgt 10b
dsb
// Set MMU table address
ldr r0, =__mmu_tables_start
mcr p15, 0, r0, c2, c0, 0 @ TTBR0
// Set Client mode for all Domains
ldr r0, =0x55555555
mcr p15, 0, r0, c3, c0, 0 @ DACR
// Invalidate TLB
mov r0, #1
mcr p15, 0, r0, c8, c7, 0 @ TLBIALL - Invalidate entire unified TLB
dsb
// Enable MMU
mrc p15, 0, r0, c1, c0, 0
orr r0, r0, #1
mcr p15, 0, r0, c1, c0, 0
isb
dsb
// Enable I and D Caches
mrc p15, 0, r0, c1, c0, 0 @ SCTLR
orr r0, r0, #0x1000
orr r0, r0, #0x0004
mcr p15, 0, r0, c1, c0, 0 @ SCTLR
dsb
isb
There are a few things to consider here. The first couple of lines fix the SPSR and CPSR. SPSR is not initialized when the core is started. If you don't fix SPSR, you will spend days and days going crazy. In my case, CPU 3 always failed to start, but sometimes CPU 2 would fail, and CPU 1 always worked.
Second, I was concerned about running code running on physical memory on CPU 1-3 while CPU 0 was running with the MMU and caches, during setup of CPU 1-3. So I wrote this code in assembly. I may go back and write the CPU 0 code in assembly, or I may try this code in C. The main thing is this code works, so it documents the order that things have to happen in to work.
I have not throughly tested this, but I have run my app a bunch of times and it works just like it did without any MMU and caching. If I find any problems, I will post what I find.
Yongcai, thanks for the help.