Regarding L1CSR0 register change.

Simbu · ‎09-20-2012

Hi All,

I'm working on a P4 Processor(e500mc) in which need to modify the status of the L1CSR0 register (Bit : 60) and trying the below pseudo code:

--> sync

--> Read from L1CSR0

--> sync

--> isync

--> Write to L1CSR0

--> isync

But, I could see the changes are not reflected or getting into some wierd issues (like panic and other memory related probelms). Am I missing something here? Could anyone provide pointer on this regard?

Thanks in Advance.

timurtabi · ‎09-20-2012

What values are you reading from, and then writing to, L1CSR0?

Simbu · ‎09-20-2012

Thx for quick reply.

I'm reading the existing value from l1csr0 and doing AND operation against 0x00000008 to update the _32 compact bit.

scottwood · ‎09-20-2012

That'll clear every other bit and leave DCBZ32 alone -- which will definitely cause problems like you describe.

If you want to set the bit, OR with 0x00000008. If you want to clear the bit, AND with ~0x00000008 (bitwise inversion of 0x00000008).

If I'm misinterpreting your response, could you show the actual code and the actual values read and written?

Simbu · ‎09-21-2012

Thanks a lot for reply Scott.

My statement is not clear in above note. Let me try once again, I'm checking AND to ensure the bit is not already set before enable the compact mode after then I'm doing OR with the 0x00000008 to update the L1CSR0 register and reverse for clear the same.

Below is the snippet of code (partial pseudo type):

I'm not able to insert the code (not sure on reason) so attached the Jpeg image format for the same.

Simbu · ‎09-21-2012

For now (short term), I'm not concerned about performance penalty or any kernel degradation. so, I have tried to setup the value in "__e500_dcache_setup" code by executing below piece of code.

ori r0, r0, 0x00000008

But no behaviour change after the above code condition also.

Thx..

scottwood · ‎09-21-2012

You said that "I could see the changes are not reflected or getting into some wierd issues (like panic and other memory related probelms)". How did you see the changes not being reflected? You read the register back and that bit wasn't set? Or you did a dcbz and saw 64 bytes cleared?

As for kernel crashes and such, are you sure that the kernel is able to work in dcbz32 mode? If you're talking about Linux, I think it'll need some changes for that, to use dcbzl. Otherwise it will only clear every other 32 bytes in a dcbz loop, because the kernel is assuming a 64 byte cache block on this platform.

pegasus711_knig · ‎10-09-2012

Hi Scott.

Been facin this thing myself. Good that someone down there has had the same thing to tackle. Nonetheless, getting back to this topic, you say "If you're talking about Linux, I think it'll need some changes for that, to use dcbzl. Otherwise it will only clear every other 32 bytes in a dcbz loop, because the kernel is assuming a 64 byte cache block on this platform"

Now I'd like to understand what you mean here. To facilitate your answer, I'd add what I understand of this:

I assume you mean it is NOT possible for the cache line size to be 32 bytes for a certain process and platform default otherwise. Right? Or do you mean something else? Or is it even possible or feasible to have such a per-process arrangement with respect to the cache block size?

Now adding to the above, my question is, If I just set the DCBZ32 bit in the L1CSR0 register, will that suffice or do I need to also do some other stuff in order to give an impression to the bootloader or the OS for that matter that my cache line size is 32 bytes and not 64 bytes?

Keen to hear from you guys

Scott Wood wrote:

You said that "I could see the changes are not reflected or getting into some wierd issues (like panic and other memory related probelms)". How did you see the changes not being reflected? You read the register back and that bit wasn't set? Or you did a dcbz and saw 64 bytes cleared?

As for kernel crashes and such, are you sure that the kernel is able to work in dcbz32 mode? If you're talking about Linux, I think it'll need some changes for that, to use dcbzl. Otherwise it will only clear every other 32 bytes in a dcbz loop, because the kernel is assuming a 64 byte cache block on this platform.

Simbu · ‎10-09-2012

As mentioned by Scott in below thread, only setting the bit in L1CSR0 will not suffice the actual need.I tired in my first attempt and got panic as part of the boot process.Another thing which I tried by placing a hook to control DCBZ and DCBZL using a configuration flag (No hope). I need to cross check the place of bit setting and also facing challenges with my toolchain which is making my app to believe the cache line size is 64 Bytes (attempting to fool the processes into believing that the cache block size is 32 bytes)

pegasus711_knig · ‎10-10-2012

Hmmm..thanks Scott and Simbu for your inputs.

So Scott a better way would be to replace dcbz, dcbza and dcbzep with their long equivalents for processes that does require the hardware default cache size that the core offers? if not then have this bit set in L1CSR0 (for d-cache), follow the sync requirements mentioned in the manuals and be done with it? What I am unable to get my head around is actually understanding how setting this bit on a per process basis, with a process being more of a kernel construct, work with the exact same register on the hardware level, with some processes requiring the entire native cache line size while the others needing only 32 bytes.

To make myself clear: Consider two processes P1 and P2. P1 doesn't need this bit set and P2 does. Now the scheduler has P1 scheduled before P2. P1 goes ahead, fetches a certain memory location, finds that the dcache does not have the needed memory location in the dcache, brings it from the RAM, updates the cache block (line) and sets the appropriate bit telling that the cache line is valid. Now the scheduler schedules P2, which ironically wants the same memory and it wants to have this memory area from the cache cleared (why would it want the same memory area is beyond me although) . So although this requirement may sound stupid, if it needs to work on the same cache line for some strange reason, and in addition it also needs it to be 32bytes only, then what if it goes and sets that bit in this register. What would happen to the other 32 bytes in that line?

To Simbu: I believe the flag, as Scott said above was process based (although it has to be global flag it seems). I wonder if having a separate config option added to the KConfigs under powerpc would be right? Any comments on this Scott?

Additionally simbu, you said your tool chain is givin you problems. Is there a way to specify this to GCC? I mean cache line size? IF yes why would you want to specify it while doing the compilation,linking, loading translation?

Hoping to hear from you fellas

scottwood · ‎10-10-2012

Pegasus711 KnightRider wrote:

Hmmm..thanks Scott and Simbu for your inputs.

So Scott a better way would be to replace dcbz, dcbza and dcbzep with their long equivalents for processes that does require the hardware default cache size that the core offers? if not then have this bit set in L1CSR0 (for d-cache), follow the sync requirements mentioned in the manuals and be done with it?

In the kernel, yes, though I wouldn't look at it as "requires the hardware default cache size" but "does not have a legacy requirement for a simulated 32-byte cache block". The best way is to avoid the need for DCBZ32 at all.

I do not recommend trying to selectively patch up userspace to use dcbzl. Either implement a per-process mechanism or lie to userspace and say that cache blocks are 32 bytes.

What I am unable to get my head around is actually understanding how setting this bit on a per process basis, with a process being more of a kernel construct, work with the exact same register on the hardware level, with some processes requiring the entire native cache line size while the others needing only 32 bytes.
To make myself clear: Consider two processes P1 and P2. P1 doesn't need this bit set and P2 does. Now the scheduler has P1 scheduled before P2. P1 goes ahead, fetches a certain memory location, finds that the dcache does not have the needed memory location in the dcache, brings it from the RAM, updates the cache block (line) and sets the appropriate bit telling that the cache line is valid. Now the scheduler schedules P2, which ironically wants the same memory and it wants to have this memory area from the cache cleared (why would it want the same memory area is beyond me although) . So although this requirement may sound stupid, if it needs to work on the same cache line for some strange reason, and in addition it also needs it to be 32bytes only, then what if it goes and sets that bit in this register. What would happen to the other 32 bytes in that line?

Again, DCBZ32 does not alter the actual structure of cache. A cache line is 64 bytes on e500mc, always. DCBZ32 just changes the behavior of instructions like dcbz. Instead of allocating and zeroing a cache line, it becomes a "zero out 32 bytes at this address" instruction. The performance benefit of dcbz is lost.

To Simbu: I believe the flag, as Scott said above was process based (although it has to be global flag it seems). I wonder if having a separate config option added to the KConfigs under powerpc would be right? Any comments on this Scott?

Additionally simbu, you said your tool chain is givin you problems. Is there a way to specify this to GCC? I mean cache line size? IF yes why would you want to specify it while doing the compilation,linking, loading translation?

Hoping to hear from you fellas

I didn't say that it is process based normally -- that was something that Simbu was trying to implement. The current state of the kernel is that DCBZ32 is not supported.

Simbu · ‎10-11-2012

To KnightRider :

The flag I was trying to make work is basically a control bit kinda (TIF__) which you can set upon your process lauch and control using task

switch (set and clear depends on prev and current process bit value). I tried to root cause the issue, but because of time constraint and few other reasons

switched to other way.

For user space app, we need to change libc to make sure it always uses

--

Currently, trying to replace all the DCBZ32 affected instruction with its extended instructions like dcbz with dcbzl. Able to code these changes (for primary and secondary cores) in

kernel source files but trying to find the source tree of FSL device drivers like DPAA which a suspect for boot hung (hoping it uses DCBZ or DCBA). Scott any idea here??

Once able to boot the kernel with DCBZ change then have prepared snippet of code (kernel space) which will clear memory range using dcbz (to verify the new changes).

Looks simple, but not coming on my way...

pegasus711_knig · ‎10-13-2012

Hi Simbu

Ok. I seem to get what you are trying to do. On browsing some more on the surface of this vast ocean called the kernel mailing lists, Ive seemed to come across a fellow called Chris who is (or was) affiliated with Genband (the company). He was apparently trying to move some legacy apps from the PPC970 to the e5500. Heres the thread Im talking about:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2010-September/085772.html

Scott, which I believe it is you, then in the follow up, asked him to check with the BSP vendor. Now I have a few questions regarding the same.

Could you please elaborate why would one want to do that in the BSP? I mean what benefit would one derive by doing it in the BSP vis-a-vis in the early init parts of the Linux kernel?
Now one effect of doing that, that I see is that every DCBZ will always be translated to an operation with 32 bytes of cache line size right? This would mean that those parts in the kernel where a DCBZ was actually dependent on the actual 'default' cache line size would need to be changed to DCBZL as you say. Right?
Would be kind enough to give me a hint on which files would require doing such a change? I mean which files or directories under $LINUX=/usr/src/linux/ would one be interested to look at? Looking at 2.6.27.10 I am trying to indentify which file(s) under $LINUX/arch/powerpc/kernel/ will I be targetting. I believe since the P4080 contains 8 e500 cores, it is also categorized as belonging to the 'Book E' family right? This means that I should be looking at head_fsl_booke.S which is the 'head.S' file for this core isnt it?

To Simbu:

Why do you think would any driver would want to have itself dependent on the width of the processor's internal cache line? That would deliberately make the code HIGHLY NON-PORTABLE. My suspicion is that there is something on the kernel side that you may have missed. Scott could you please comment?

Regards.

scottwood · ‎10-15-2012

1. By "BSP" I don't mean something separate from the kernel. I mean a package containing a kernel, u-boot, root filesystem, toolchain, etc. provided by Freescale containing support for hardware that may not have yet been merged into upstream Linux. Freescale used to call this a BSP, and more recently calls it an SDK. Note that since then, e5500 support has been merged into upstream Linux (but not all I/O, such as datapath).

2. Right.

3. Use grep to find instances of the instruction. Why are you still working with 2.6.27? Yes, e500mc is a booke core.

Simbu · ‎10-13-2012

Hi KhightRider,

Ok. I seem to get what you are trying to do. On browsing some more on the surface of this vast ocean called the kernel mailing lists, Ive seemed to come across a fellow called Chris who is (or was) affiliated with Genband (the company). He was apparently trying to move some legacy apps from the PPC970 to the e5500. Heres the thread Im talking about:

Thanks for this information and link.BTW, some months before itself I surfed a lot and found this information as useful for my ongoing activity.

Why do you think would any driver would want to have itself dependent on the width of the processor's internal cache line? That would deliberately make the code HIGHLY NON-PORTABLE. My suspicion is that there is something on the kernel side that you may have missed. Scott could you please comment?

My assumption is that DPAA or USDPAA may use dcbz or cache instruction as its providing more optimized zero copy which is very good arch in e500mc.As mentioned by Scott, it is using but not dcbz...there are macros like dcbzl_64 which is always using dcbzl..I found one occurrence of dcbz and replaced same with DCBZL and one video source also using dcbz instruction.

With all the above changes I met problem during kernel boot process which currently I'm investigating.

scottwood · ‎10-11-2012

The datapath drivers are in drivers/staging/fsl_qbman, though it looks like they already use dcbzl.

scottwood · ‎10-09-2012

DCBZ32 doesn't actually make the cache line be 32 bytes, it just changes the behavior of dcbz, dcba, and dcbzep.

The kernel assumes that dcbz operates on the full cache block, as it properly considers the actual cache block size instead of assuming 32 bytes. If only 32 bytes get cleared when it's expecting 64 bytes to get cleared, bad things will happen. There is a dcbzl instruction that always uses the real cache block size regardless of DCBZ32. You'll need to use that in the kernel if you set DCBZ32.

You'll also need to make sure that you don't have DCBZ32 set for any processes that are assuming the real cache block size will be used (or fool the processes into believing that the cache block size is 32 bytes, though there is a performance penalty for DCBZ32 so you may not want to do that). It's more realistic to change it on a per-process basis than on kernel entry/exit, though in either case it's not something that is currently implemented in Linux.

Simbu · ‎09-21-2012

You said that "I could see the changes are not reflected or getting into some wierd issues (like panic and other memory related probelms)". How did you see the changes not being reflected? You read the register back and that bit wasn't set? Or you did a dcbz and saw 64 bytes cleared?

Basically, have introduced per-thread flag in kernel to enable and clear this functionality and could see my app (able to dump the memory segment which is affected by the dcbz inst) is not able perform
data cache operation on 32 Bytes intead could see always doing on 64 Bytes.

I have checked the value after modfying the register and could see the changes are reflected back in the register, but not
in my app.

Below is the snippet of code in detail:

-- Introduced Per-thread flag (TIF_..).
-- Enabled the functionality according to the flag (!TIF_..)
-- handled the flag clear and functionalty disabled in sys_execve () and compact_sys_execve
-- Load app (syscall -> Kernel -> set and enable the flag & Fucntionality)
-- Verification of DCBZ inst.

I'm not sure there is any other simple (as compare to above one) to verify my app execution for time being would be good.

Every I tried to do something like CPU specific (CPU 0) or User program specific (as changes are required for user program alone), but
no expected results.

scottwood · ‎09-21-2012

How are you testing that the DCBZ32 bit is actually set at the time (i.e. that there's no bug in the per-process mechanism)? E.g. single step with an external debugger, make a system call immediately before and after doing the dcbz to dump L1CSR0, etc.

Have you tried dcbz with DCBZ32 under a simpler environment such as a U-Boot, a standalone program, or early kernel boot to verify that it's working as expected?

Are you sure you're using dcbz and not dcbzl (i.e. bit 0x00200000 of the opcode must be unset)?

Simbu · ‎09-22-2012

Thanks Scott !!!

In my app I'm using an externel single step debugger to execute my app program store instruction by instruction and parllely checking the memory location which DCBZ suppose to clear before and after instruction execution (As we are discussing it is zeoring 64 Bytes

instead of 32 Bytes),I'm not dumping L1CSR0.

My assumption is that when there is a context switch then thread will be set which will enable the DCBZ_32 functionality for an user app

so that instruction will get execute in 32 Bytes (may be not 100 % clear as I could give more from my code).

I'm not able to debug the context switch or either task switch scenario and not sure how wise to make a syscall for each execution of

dcbz instruction as the occurances of the instruction in program store is quite large.

I haven't tried to check the behaviour during early kernel reboot yet, but maybe need to give a try if there is suspect it maynot work in

kernel mode too.But I tried to set the value and dumped the value of L1CSR0 after updation and could see it got refelcted properly.

Could you please suggest me a way to achieve the debug platform for DCBZ? I can place a breakpoint in a porgram store(user app) to get a hit but

not able to imagine how could I achieve the 32 Bytes execution for specific app without making system wide change.

scottwood · ‎09-24-2012

Please read L1CSR0 with the external debugger while stepping through the code where you see dcbz affect 64 bytes. Let's focus on whether the hardware is behaving properly -- debugging your Linux modifications is beyond the scope of this forum (especially without actually seeing them). You could ask your sales contact for such a mechanism as an SDK feature request, or if you want to work with the open source community to get this debugged and merged into the kernel, you could post an RFC patch against the latest upstream kernel to the open source linuxppc-dev list (be sure to mention that you need help debugging it).

The suggestion to make a system call was for debug purposes -- to have the kernel print L1CSR0 at that moment -- not something that should be left in the final product. It would be better to use the external debugger to read L1CSR0, though.

Regarding L1CSR0 register change.

Regarding L1CSR0 register change.

QorIQ P4 Devices