Instruction Cache, MCF523x processor, CodeWarriors V10.0

Dogbert256 · ‎11-29-2011

Hello! Please Help!

I have my own prototype design up and working. Now I'm trying to get the cache enabled to increase performance. For this processor, the cache is 8KBytes. The processor also has a 64KByte internal SRAM (non-cachable). Code is executed from an external Flash memory chip (8MBytes Flash on CS0 in x16 configuration). The cache for this processor can either be instruction, data, or split 50%/50% instruction and data. For my application, instruction only will provide best performance increase.

There are only three relevant registers to configure: CACR, ACR0, ACR1. Good to this point. I want a single unified instruction cache, from the 8MBytes on CS0. Thus I'd think that I would set ACR0 and ACR1 to the same value. However, the documentation is vague on how best to set these registers such that they would both work in conjunction in the same base memory area. In other words, the address space for ACR0 and ACR1 overlap completely. Is this a problem? I don't want something strange happening such as half of the cache being wasted because the cached instructions in both halves of the cache are the exact same.

Each ACRx register covers an address area of 16 MBytes in size (i.e., A31:A24 must match the value in ACRx to be recognized as cacheable). I could do something strange, such as memory map the Flash on CS0 across a boundary that is say, 4 MBytes. That would allow me to split half of the memory into ACR0 and half into ACR1. However, in reality, the total size of the executable code is going to be less than 512 KBytes (maybe even less than 256 KBytes). It will be impossible to split that across non-overlapping 16 MByte boundaries.

If setting ACR0 and ACR1 the same results in wasting 1/2 of the cache, I'd rather use the cache as a 50% instruction, 50% data cache, with the data residing elsewhere, such as my external SRAM. Though I have to be careful about that as there are DMA processes going on in this SRAM, and that is incompatible with caching.

I investigated a little further using the ColdFire Init utility. It gives me a warning "The cache memory regions defined by ACR0 and ACR1 overlap, which may not be what you intended".

Any insight or documentation you can point me to would be appreciated greatly!

Dogbert256

TomE · ‎11-30-2011

I see your problem. Too many different memory areas, not enough control registers.

I get the idea of allowing large DMA buffers into memory for download so the FLASH programming code doesn't have to be bothered by I/O.

One of the good things about the ColdFire, derived from the 68000 are the multiple interrupt levels. You could always have a DMA Interrupt coming in at IPL6 and interrupting the FLASH burning code. So you could have the serial DMA happening to small SRAM buffers and then take interrupts that just block-move that into external SRAM buffers, where that's the only thing that ISR does, so it could safely interrupt FLASH burning code.

> I cannot cache the data Flash chip because it simply wouldn't work. The cache

> doesn't understand the Flash programming algorithm.

Firmware updating is a "special case". It isn't exactly something you need to do quickly either.

So why not globally disable all caches while you're programming the firmware? Or disable it around the "write commands and data to FLASH" functions?

> Erase and blank check data Flash chip in 29 seconds.

> Burst writing enabled: 23 seconds.

> Burst writing and BWE enabled: 22 seconds.

> Cache setup as instruction cache (ACR0 = ACR1), burst writing, and BWE enabled: 13 to 14 seonds!

The Erase time should be constant and controlled by the FLASH. So what you're measuring there is the "blank check time".

Don't pass up the possibility of copying small fast functions to the internal SRAM. Or copying all the FLASH-burning code to the internal SRAM.

> internal SRAM, which is of course non-cacheable.

More the point is that it is single-cycle accesses so it would be pointless to cache it.

> The external data Flash is to be used for read/write data storage, and needs to be uncached.

Or needs cache invalidation around the DMA buffer allocation functions. This isn't all that hard. I've been able to avoid doing it on the project I'm working on, but someone out there (google?) should have some working sample code that does this properly.

> Fsys = 16.88 MHz:

Why not the full 150MHz if speed is at all important?

> I am currently just setting both ACR0 and ACR1 to be the same.

I'd recommend against that as it may cause some sort of internal address contention. For what you're doing should you just enable one and disable the other one?

Tom

View solution in original post

TomE · ‎11-29-2011

> Thus I'd think that I would set ACR0 and ACR1 to the same value.

I can't see why. The two ACRs are for mapping different memory areas.

More precisely, CACR sets up a "default mapping", and the two ACRs are "exceptions" for two user-defined memory areas.

So you could set up CACR to "cache everything" and then use ACR0 to "uncache" the I/O registers and ACR1 to "uncache" some other memory region (like where you have your DMA buffers if you don't want them cached), or your FLASH if you're using it as a file store and have to program it.

> Though I have to be careful about that as there are DMA processes going on in this

> (external) SRAM, and that is incompatible with caching.

Then why not write your DMA handling code to be compatible? When the CPU writes data to a DMA buffer, flush it before starting the DMA, and when assigning a buffer for DMA Reads, invalidate it first. That's all you have to do.

For "small device transfers" (in our case USB and low throughput Ethernet) the DMA buffers can go into the internal static RAM. You've got 64k of internal SRAM. That should be big enough for a lot of things.

Make sure you put the CPU Stack in the Static RAM (if you're not doing multithreading with multiple stacks). That speeds the CPU up quite a bit.

Otherwise you look to have to have all your FLASH cached (good) and the data not cached (not so good). What is the CPU doing that it doesn't have much data access? Caching isn't just about speeding up access. Usually only the cache can perform burst reads and writes to memory. This is probably more important with SDRAM than with SRAM, but SRAM is usually burstable. So if it isn't cached it will be read and written one word at a time, wasting a lot of CPU clocks.

Play with the "BWE" bit in the ACRs and DBWE in CACR. Depending on what the CPU is doing this may get you a speedup.

I'm using the MCF5329. It has "writethrough" and "writeback" modes, with "writethrough" being both faster for what we're doing (writing to RAM acting as an LCD pixel buffer) and also not needing cache flushes.

Would you believe the fastest SDRAM to SDRAM memory copy I've found involves copying from SDRAM to the internal SRAM first and then separately copying it back out again? It is a lot faster than any direct SDRAM to SDRAM copy. These are the kink of things that caching an bursting can result in.

Tom

Dogbert256 · ‎11-29-2011

Thanks for your reply!

I'm aware that you can set CACR to cache everything and ACR0, ACR1 to uncache two specific areas. But I have two Flash chips (firmware Flash chip + non-volatile data Flash chip) and an external SRAM. I don't want to cache IPSBAR space, the space of one of the Flash chips, and the external SRAM. That is too many areas to unblock. Since I only want the firmware to be cached, I only want one address space to be cached, and it seemed more straightforward to set CACR to uncache everything and set one/both of the ACRn registers to cache the Flash firmware address space.

I have totally avoided DRAM memory, I just don't like it. Adds a lot of complexity, is slow. It is cheap for large chunks of memory, but I code real tight and don't need a massive amount of read/write memory.

I cannot cache the data Flash chip because it simply wouldn't work. The cache doesn't understand the Flash programming algorithm.

The reason I don't want the external SRAM to be cached is because it mostly exists to download large data chunks from UART via the DMA controllers (firmware upgrade data). That occupies the upper half of the external SRAM. The lower half is reserved for running the firmware code from SRAM. When this occurs, I will be writing to both of the Flash chips, so both of them should not be cached, plus IPSBAR and again I run up against the two memory space uncache ability of the ACR0 and ACR1.

I have set my DMA transfer buffers to be flushed (buffer values set to 0x00, UART NULL byte). In addition, I have pointers to the buffer which are set up to point to the current buffer address, and indicate whether it is valid (I do this in the DMA ISR). And yes, for the small data transfers that are typical, I place the DMA buffer into internal SRAM. There are other DMA transfers which are much larger and go into the external SRAM (the firmware upgrades).

The CPU stack and the heap are located in the internal SRAM, and yes that has given me a big performance boost. Virtually all data access is from the internal SRAM, which is of course non-cacheable. The stack is large, so is the heap, and I'm not anywhere close to using them to 100% capacity.

The external data Flash is to be used for read/write data storage, and needs to be uncached. If I want to run the firmware at high speed, I use one of the DMA controllers to transfer the firmware from Flash to external SRAM. When running code from the exrternal SRAM, I invalidate the cache and reconfigure it as an instruction cache at the external SRAM address.

Since the cache is setup as a 100% instruction cache, when running code from external SRAM, I pretty sure that it will simply ignore any data references to the upper half of the SRAM, simply because instruction cache should ignore data read/ write operations (you'de think, and I need to confirm this).

Last night I programmed the cache and did some benchmark tests at Fsys = 16.88 MHz:

cache disabled, burst writing disabled, BWE disabled: Erase and blank check data Flash chip in 29 seconds.

Burst writing enabled: 23 seconds.

Burst writing and BWE enabled: 22 seconds.

Cache setup as instruction cache (ACR0 = ACR1), burst writing, and BWE enabled: 13 to 14 seonds!

It is working well right now. I am currently just setting both ACR0 and ACR1 to be the same. I just want to maximize my usage of the cache, and if I'm only using 4K I'm not happy. Reading through the documentation, I don't get the feeling that only half of the cache is being used if I set both ACRn to be the same. More of a feeling that these define the attributes of the memory area that they refer to. Because in the mode where the CACR specifies what to cache, and the ACRn specify what not to cache, it makes no sense that the ACRn each control half of the cache.

So I conclude that I'm using the full 8 Kbytes of cache. Just want to be sure, though!

Thanks for any thoughts,

Dogbert256

TomE · ‎11-30-2011

I see your problem. Too many different memory areas, not enough control registers.

I get the idea of allowing large DMA buffers into memory for download so the FLASH programming code doesn't have to be bothered by I/O.

One of the good things about the ColdFire, derived from the 68000 are the multiple interrupt levels. You could always have a DMA Interrupt coming in at IPL6 and interrupting the FLASH burning code. So you could have the serial DMA happening to small SRAM buffers and then take interrupts that just block-move that into external SRAM buffers, where that's the only thing that ISR does, so it could safely interrupt FLASH burning code.

> I cannot cache the data Flash chip because it simply wouldn't work. The cache

> doesn't understand the Flash programming algorithm.

Firmware updating is a "special case". It isn't exactly something you need to do quickly either.

So why not globally disable all caches while you're programming the firmware? Or disable it around the "write commands and data to FLASH" functions?

> Erase and blank check data Flash chip in 29 seconds.

> Burst writing enabled: 23 seconds.

> Burst writing and BWE enabled: 22 seconds.

> Cache setup as instruction cache (ACR0 = ACR1), burst writing, and BWE enabled: 13 to 14 seonds!

The Erase time should be constant and controlled by the FLASH. So what you're measuring there is the "blank check time".

Don't pass up the possibility of copying small fast functions to the internal SRAM. Or copying all the FLASH-burning code to the internal SRAM.

> internal SRAM, which is of course non-cacheable.

More the point is that it is single-cycle accesses so it would be pointless to cache it.

> The external data Flash is to be used for read/write data storage, and needs to be uncached.

Or needs cache invalidation around the DMA buffer allocation functions. This isn't all that hard. I've been able to avoid doing it on the project I'm working on, but someone out there (google?) should have some working sample code that does this properly.

> Fsys = 16.88 MHz:

Why not the full 150MHz if speed is at all important?

> I am currently just setting both ACR0 and ACR1 to be the same.

I'd recommend against that as it may cause some sort of internal address contention. For what you're doing should you just enable one and disable the other one?

Tom

Dogbert256 · ‎12-01-2011

Hello,

On those rare occasions that I upgrade firmware on the code Flash, I will invalidate the cache. My firmware must update itself via the Internet to WiFi connection, and it must do it while still maintaining primary functionality! And all in a totally fault tolerant manner. I can't have a bricked device if the power fails.

The second Flash is non-volatile memory, and it is being written to frequently (maybe 2000X per day). The data is stored in a circular buffer on Flash, with program and erase occurring. At this rate I've calculated that the Flash will only last 2000 years! (It has 100,000 erase cycles).

So this second Flash should be excluded from the cache - maybe. It's not code, and the processor should simply not cache it if I've configured it for instruction only cache.

Well, yes, I could rev the CPU up to 150 MHz. But the cost is increased power, so I try to get the optimal efficiency first, then only rev up the CPU to the necessary speed to accomplish the goal at hand, then crank it up another notch for safety.

Everything is optimized for very-low power.

I think there is a point in disabling one of the ACRn registers.

By the way, I love the hardware. My first embedded design was back in '94. My new design has 10X more functions, 10X more speed, 10X smaller, less power, at a fraction of the cost!

There is still a thing (or two...) that I haven't figured out. In the MCF523x Reference Manual, there are many references to "clock cycle". But some of them refer to Fsys (CPU core speed), and others refer to Fsys/2, or CLKOUT speed. I see very clearly that the cache and internal SRAM will operate in a single clock cycle. And I think that is referring to Fsys, not Fsys/2. Would that be correct? If so, the internal memory operates at least 4X faster than a zero wait state external SRAM.

Thx,

Dogbert256

TomE · ‎12-02-2011

> Well, yes, I could rev the CPU up to 150 MHz. But the cost is increased power,

Not to a first approximation.

If you look in the Reference Manual -- silly me, this is obviously in the Data Sheet...

if you look in the Data Sheet and find the section that lists power consumption versus CPU modes and frequency, you'll find...

They haven't bothered to write this section for this chip. Even the line that lists "Pad Operating Supply Current - Low Power Modes" is still listed as "TBD" after all these years.

So that means to me that this chip is not meant for "low power operation" and if you want low power you should select a chip where the manufacturer cares about this and gives you enough documentation to allow you to "Engineer a design" rather than "Reverse_Engineer a design". Otherwise you're likely to find a bunch of things that prevent the "low power modes" from operating properly. For instance the "Keep Alive Power" pin on the MPC860 drew 20mA instead of 10uA due to "floating nodes" and the workaround was "provide 20mA keep alive power" for a chip that was meant to stay alive from a coin cell, but never could.

The CPU Core draws 135mA at maximum frequency. You should be able to get that down to 1/10 running at 1/10 clock speed - as long as there aren't a lot of parasitics in there you don't know about. Note the "Pad Operating Supply Current" is listed as 100mA, and that's with it not doing anything (not obviously driving any signals), so if that is true it is useless trying to save CPU Core power if that's a "static power draw" from the I/O logic.

Back to the original topic.

With a "well designed chip and system" the power consumption doesn't depend on the clock frequency. Surprised?

The power consumption should depend on how many instructions per second get executed, and not how fast they are doing it.

That assumes that when there's nothing for the CPU to do it does nothing in a "low power stop mode". For the Coldfire this is:

__asm(" stop #0x2000\n");

Or whatever the syntax is for your compiler. When stopped the CPU effectively draws no current.

So on a slow clock it may be running 50% of the time, but run the clock 10 times faster and it is now drawing 10 times the current but is only running 5% of the time. These should cancel out to a first approximation.

Your memory busses take power per access cycle, and not based on the speed. So running from internal SRAM and cache is always a win here.

The secondary effects that do increase the power with the clock speed are due to the peripherals. Peripheral modules usually draw more power when fed with a faster clock, but not all. They don't draw nearly as much as the CPU, so for a first approximation can be ignored (with the exception of things like Ethernet which are high powered anyway). The Oscillator also scales with frequency, but that's the crystal frequency and not the PLL one, and the PLL draws more current when running faster, but usually not much.

You should power off peripherals you're not using (or not using currently). You could even turn the PLL on and off if you really need to save power and can handle the peripheral clock changing speed during the switches (timers, UARTs and so on).

I've worked with chips that separately list all the CPU peripheral, crystal and PLL current consumptions and their variation with clock speed. That allows you to DESIGN to a specific power consumption. Would you believe a complete board with the CPU running and keeping time taking 0.5mA? That needed a lot of work, including programming the ADC inputs as OUTPUTS when not sampling as they drew 0.2mA each if I didn't do that.

If the Data Sheet doesn't detail the power consumption like this then that chip isn't meant for low power operation.

> Everything is optimized for very-low power.

How low do you need? The WiFi has to be taking a lot of power.

Are you using the FEC to get to the WiFi hardware? Minimum FSYS/2 is 50MHz for that.

> And I think that is referring to Fsys, not Fsys/2. Would that be correct?

Reference Manual "Figure 7-1. MCF5235 Clock Connections" tells you all you need to know about this.

> If so, the internal memory operates at least 4X faster than a zero wait state external SRAM.

Look at the Data Sheet, specifically "Figure 11. Read/Write (Internally Terminated) SRAM Bus Timing"

One read seems to take SIX clocks minimum, and these are FSYS/2. So they're 12 times slower than SRAM or Cache for 32-bit memory (or 24 times slower for 16 bit or 48 times slower for 8 bit). There are probably some extra hidden "arbitration delays" in there as well, that might add another clock or two.

This is trivially simple to benchmark. Write some memory-reading code using MOVEM.L instructions and time them from internal and external SRAM. Report back here.

Tom

Dogbert256 · ‎12-03-2011

No reverse engineering here. Believe it or not, I was concerned about the (ambiguous T.B.D.) power draw for the MCF523x, and I actually was able to get through to apps support at Freescale. It was as difficult as pulling teeth, but I got answers. The bottom line is that in lowest power mode (STOP mode), the processor draws "a couple tens of milliamps". Certainly not micro-powered, such as the Motorolla 68332 processor, which draws in the uA range when stopped.

I didn't want to design using a 21 year old processor, and I absolutely needed the TPU (or eTPU) unit. Everything was carefully considered.

// With a "well designed chip and system" the power consumption doesn't depend on the clock frequency. Surprised?

Come on, you can't be serious! The following should be true to first order, all things being equal (same code, zero wait states external memory, etc...)

CMOS Processor Power Draw ~ (proportional to) C x V^2 x Fsys + Kq

Where C is the "capacitance of the CPU", V is the CPU voltage, Fsys is the CPU frequency, and Kq is the zero frequency quiescent current.

Since C, V, and Kq are constants, we take the derivative, note that the result is a linear equation, and take the expectation:

mean { (delta) CMOS Processor Power Draw } ~(proportional to) mean { (delta) Fsys }

It's not surprising if you run the CPU at 150 MHz for 5% of the time, and stop it for 95% of the time, the power draw is propotional to BOTH average frequency and average number of instructions being run (neglecting Kq).

The correct answer for external memory access is found from figure 7-11. For back-to-back external memory access, the processor takes 3 Fsys/2 cycles, or 6 Fsys cycles. Of course, there are additional bus latency issues that will certainly add to this.

Actual power draw for my "CPU board", which includes the CPU, buck regulator for 3.3 V, linear regulator for 1.5 V, all memory, real-time clock, supervisor, some digital chips:

Stop mode, CLKOUT disabled: .156 W

CPU in fault-on-fault state: .228 W

CPU executing firmware from external Flash, Fsys = 16.5888 MHz, full I/O drive strength: .276 W

A quick calculation shows to first order approximation for power draw:

delta power draw per delta MHz ~ .0029 (W / MHz)

Extrapolating to 150 MHz, we get somewhere around .66 W. With an understanding of the external circuitry and power supplies, this works to less than 175 mA for both the core and the I/O. This is understandably less than the nominal value of 235 mA given in the MCF523x_Specs.pdf, because I'm not running all the CPU modules concurrently.

Nothing on the power data has surprised me.

// How low do you need? The WiFi has to be taking a lot of power

Of course, the WiFi module takes a lot of power, as much as ~1 watt in transmitt mode. Fortunately, this does not happen very often, and the average power is less.

Some more timing benchmarks:

Program entire data Flash memory chip, code in internal SRAM: 100 seconds

Program entire Flash memory chip, code in external code Flash: 515 seconds

Yes, I know that the program time is constant, and that is 7 uS per word, for 4 Mwords, or about 30 seconds. And yes, data access time is constant, both programs access data in internal SRAM. Cache was disabled.

That comes out to about a 7X improvement in speed, probably better.

Running the CPU at 150 MHz and stopping it on a dime is not the best way to run an embedded processor. Many of the firmware "things" that are happening are wait loops, such as 30 uS for the QSPI to finish, as much as 1 mS waiting for the UART TXD buffer to clear. While in some cases useful code can be done during this time, in most cases not, as the processor needs to execute in a linear fashion, i.e., I need that info from the QSPI to proceed to the next step.

I'd rather not have my processor in small "wait loops" running at 150 MHz.

I'm aware that the DMA can be used to transfer data to the UART TXD. But I've got 3 active UARTs, and only 4 DMA Controllers. I must use three of them for UART RXD, leaving me with only one left. I have to work with what I have.

Dogbert256

TomE · ‎12-04-2011

First, I'm interested in the following that I asked in my first post:

> Are you using the FEC to get to the WiFi hardware?

I'm interested in the interface in case I have to interface to WiFi someday. You can't be using USB, so is your WiFi chip on I2C, SPI, the FEC, Serial or what?

> > With a "well designed chip and system" the power consumption doesn't

> > depend on the clock frequency. Surprised?

>

> Come on, you can't be serious! The following should be true to first order,

> all things being equal (same code, zero wait states external memory, etc...)

>

> CMOS Processor Power Draw ~ (proportional to) C x V^2 x Fsys + Kq

I did say "well designed', and that means the CPU should be stopped and drawing minimal power when it has nothing to do.

> Running the CPU at 150 MHz and stopping it on a dime is not the

> best way to run an embedded processor.

I've always found it to be a very good way. I've never coded any other way, or found any reason to.

> Many of the firmware "things" that are happening are wait loops,

> such as 30 uS for the QSPI to finish, as much as 1 mS waiting

> for the UART TXD buffer to clear.

Wait loops! That's what I meant when i said "well designed'. There 's no good reason to ever use wait loops except in a simple bootstrap (or an application as trivially simple as a boot about to load the "real application").

> I didn't want to design using a 21 year old processor,

But you wanted to use the software from way back then? Actually that's unfair. Software 21 years back way way better than that. :smileyhappy:

30us is 4,500 cpu instructions. 1ms is 150,000. Less at your clock speed, but still a waste.

The current product I'm working on has three CAN chips and an ADC on the QSPI bus. All run independently through the QSPI. Nothing "waits". Everything writes QSPI "message buffers" to a queue together with callbacks. So the code that handles CAN interrupts has to read status registers and then read or write messages. These are all done with a "chaing of functions" running under the QSPI interrupts and callbacks in the "message buffers". Simple, expandable and the ADC code doesn't even know the CAN code is there, and vice versa.

> I'm aware that the DMA can be used to transfer data to the UART TXD.

But Interrupts can do this a lot better than polling. 21-year old CPUs took a while to service interrupts, but this one should be able to service a UART interrupt in a microsecond or two, or even less.

> The correct answer for external memory access is found from figure 7-11

Figure 7.1 of what manual? There's no "7.1" in the Data Sheet (MCF5235EC.pdf) or Reference Manual (MCF5235RM.pdf). Do you mean something like "Figure 16-3. Secondary Wait State Reads" in the Reference Manual that shows burst reads (4 32-bit reads in 10 clocks total)?

> For back-to-back external memory access, the processor takes 3 Fsys/2 cycles, or 6 Fsys cycles.

I think you'll find that only applies for BURST accesses. These are "special things" in the ColdFire chips and can only be generated by DMA cycles, MOVEM.L (Move Multiple Registers) Assembly instructions and Cache Line Fill or Write. Normal writes to memory are not only non-burstable, but they also attract 2 clock "pipeline stalls" that MOVEM.L doesn't.. See "3.8.1 Timing Assumptions" in MCF5235RM.pdf.

Please prove me wrong. Please measure your external bus read and write speed using something simple not involving writing to the FLASH. Cached reads should be a lot faster due to the bursting (assuming reading blocks far longer than the cache size), but I'd expect normal C-code reading (like your checking for all-ones erased FLASH) to run fairly slowly..

> Come on, you can't be serious!

You've collected some good measurements already. Please prove me wrong with a few simple extra measurements.

> Extrapolating to 150 MHz,

Please measure "running (tight loop running in SRAM) at a low speed and a high speed (150MHz if you can), and then measure at both speeds with the CPU stopped. I'd expect the stopped power to be pretty similar, and the 100% power to scale with clock speed.

Tom

TomE · ‎12-05-2011

I just typed in a long and detailed analysis of a set of current measurements I've taken on my MCF5329 board, but I clicked "submit" without taking a precautionary copy, and the Freescale Forum site lost the lot.

In summary (and from memory) my board takes 86mA at 12V at 16MHz when stopped and 15mA more when running at 16MHz.

It takes 30mA more than that when stopped at 240MHz and 15 * 15 = 225mA more when running.

So the CPU takes 15mA@12V per 16MHz running, but takes an extra 30mA of "overhead" for the chip (and PLLs and clock distribution) to support 240MHz.

To execute the same number of instructions at 240MHz that it can at 16MHz takes 45mA (30 overhead, 15 due to instructions) versus 15mA at 16MHz. But the advantage is a peak throughput 15 times the 16MHz rate if and when needed.

So it isn't the same "power per instruction" like I thought, but it isn't 15 times either. It is between our polarised opinions.

Tom

Dogbert256 · ‎12-06-2011

TomE,

> Are you using the FEC to get to the WiFi hardware?

The quick answer is no. The reason is because the WiFi module itself provides that functionality. The WiFi module has it's own embedded processor that does operate at a very high frequency (it must do so). It provides WiFi access plus Ethernet access. The interface between my CPU and the WiFi is to be CMOS logic UART data. Thus the 50 MHz min Fsys does not apply.

I will agree as follows, when dealing with typical benchmark code, the following is THEORY for CMOS processors:

mean{ CPU power } proportional to mean{ instructions per second } proportional to mean{ CPU frequency }

When I was in CMOS classes, this was repeated over and over, when I worked at Intel everybody would repeat the same.

In the real world, things are apparently NOT always the same.

// But you wanted to use the software from way back then? Actually that's unfair. Software 21 years back way way better

// than that. :smileyhappy:

No, I'm not using 21 year old software. Processor Expert does not exist in CodeWarriors for the MCF523x processor, so I've had to do everything from scratch. Even if Processor Expert existed, I would have rewritten the code. Many of my queries to this forum have been because I have no code examples to work with. But I think you meant that in jest.

>> The correct answer for external memory access is found from figure 7-11

> Figure 7.1 of what manual? There's no "7.1" in the Data Sheet (MCF5235EC.pdf) or Reference Manual

> (MCF5235RM.pdf).

> Do you mean something like "Figure 16-3. Secondary Wait State Reads" in the Reference Manual that shows burst

> reads (4 32-bit reads in 10 clocks total)?

My mistake. I meant Figure 17-11, on page 17-11 of the MCF523x_Reference_Manual.pdf, the timing diagram for back-to-back bus cycles show that the external memory take place in 6 Fsys cycles (Fsys/2 is shown directly above the diagram). From the manual "For example, when a longword read is started on a word-size bus, the processor performs two back-to-back word read accesses." (Page 17-10). My memory is x16, this would apply.

From Figure 17-13 on page 17-13, a burst read cycle with internal termination can actually be competed in 2 Fsys cycles! I'm not so sure I believe this... I didn't write the manual.

Again, the bus can't maintain this rate, but those are the values. (The Figure 16-3 values include a wait state, equal to 2 Fsys clocks for a total of 4 Fsys clocks).

Rather than read any more documentation, I scoped my firmware Flash memory, on instruction fetch, and the time from the rising edge of CS0 to the next rising edge of CS0 is 12 Fsys cycles. Well, I would like to have some empirical data to back up Figure 17-11, but I just don't have it. Underneath Figure 17-11, it says "the initiatiation of a back-to-back cycle is not user definable", so I'm probably not in luck spending more time on that.

I would agree with you in general: Wait loops within a cycle of code in general are not desirable. When you've executed the current "embedded cycle", and are waiting for the next cycle to arrive, use the "waiting" part of the CPU at that junction.

My application is extremely I/O intensive, but is not fast. There are from 12 to 30 analog signals that I read. Accuracy is required, but not speed. The analog signals go through analog muxes into a two channel delta-sigma A/D converter. I can only get 15 A/D channel readings per second from that A/D converter. This is my main bottleneck I'm dealing with, and it's orders of magnitude over the not so pressing for me 20 to 30 uS QSPI timings. My world is dealing with filling in that 66 mS of wait time with useful code, before the A2D converter signals back to me that there is another conversion ready to be shifted out, and I process the analog data.

I'm looking into a better solution for UART TXD. Maybe an interrupt scheme with a buffer, maybe there is a way to get at those FlexCAN buffers and use them.

Yet, I'm still able to process a cycle of code with plenty of time to spare. A cycle for me is from 5 seconds to 30 seconds. I have to read some of the analog channels many times in a cycle, but it is done. I did not select this processor because of speed, mostly because of functionality, and those eTPU channels.

There are some problems with running at 150 MHz, I just list a few of them. I will admit these are applications specific:

1.

I have an eTPU unit, four DMA channels, and a CPU core that all run completely as parallel processes. There are many other peripherals that are "mostly" parallel, such as UART, etc... When my processor is issued the STOP instruction, all clocks stop. Everything comes to a dead halt, but that is not an option I have available. So only one of the low power modes that halts the core and keeps the peripherals running would work. The core uses up 135/235 = 57.5% of the overall power of the processor (if the linear regulator that supplies the 1.5 V for the core from the 3.3 volt supply is added into the equation). The other 43.5% of the overall CPU power is used in the peripheral bus, and, being CMOS logic, it, in THEORY should draw power proportional to the frequency it is clocked at (Fsys/2). I don't mean to be too precise as the actual results can vary widely in the real world.

2.

My application is cost sensitive. I have no memory bus buffers like the demo board has. The documentation says that the MCF523x does not have a lot of drive capability on the memory bus, compared to the 68332 processor. While the SRAM is directly underneath the CPU, and the Flash chips are directly above, with a short bus, the capacitance of the pads is the dominant factor. With full I/O drive strength, the rise time on my memory bus is 8 nS (the scope probe capacitance probably causes some of this). This becomes an issue if the memory needs to be accessed really fast. Mine does not.

3.

I'm currently in the solar thermal world, tons of I/O, but overall not fast. I just don't need the speed. If my processor is capable of 64 MHz, and it looks like it is (maybe with a wait state or two running from external SRAM, but assuming the wait states don't bog down the more typical 12 Fsys cycles, something I'm going to have to research), that is enough to claim "capable of anything reasonable" status in this world I'm currently in.

This is all I have time for.

The whole issue of "speed" is reletive to the the application at hand. Not a universal rule.

Dogbert256

TomE · ‎12-06-2011

> The interface between my CPU and the WiFi is to be CMOS logic UART data.

I've had to do PPP over modems enough times. It isn't fun. I hope you've got a simpler interface.

> when I worked at Intel everybody would repeat the same.
> In the real world, things are apparently NOT always the same.

Intel has some pretty neat power saving features in their later CPUs.

> I meant Figure 17-11, on page 17-11 of the MCF523x_Reference_Manual.pdf,
> the timing diagram for back-to-back bus cycles show that the external
> memory take place in 6 Fsys cycles (Fsys/2 is shown directly above the diagram).

Six "CLKOUT" cycles. CLKOUT (Figure 7-2. Clock Module Block Diagram) is definitely Fsys.

> From the manual "For example, when a longword read is started on a word-size
> bus, the processor performs two back-to-back word read accesses."
> (Page 17-10). My memory is x16, this would apply.

So your accesses to external RAM take 12 times longer for 32-bits, but only 6 times longer for 8 or 16 bit accesses. Except there's probably no reason for you to force "back to back mode". You should allow "burst mode" which would do 32-bit reads in about 8 or 10 clocks of 2 16-bit bursts.

> From Figure 17-13 on page 17-13, a burst read cycle with internal termination
> can actually be competed in 2 Fsys cycles! I'm not so sure I believe this..
> I didn't write the manual.

That's the standard "page mode" or "memory burst mode" I mentioned. DMA can usually read the memory that fast. So can what they call "line transfers". I'm pretty sure that means if you program the chip select decoders to NOT be "burst inhibited" and you have the cache enabled on that memory space, then cache loads and flushes can run that fast.

You don't have speed problems (yet :smileyhappy:. You don't need to optimise this. Getting the instruction cache "burst loaded" from your instruction FLASH would be simple and worthwhile though. I've almost never programmed anything where speed wasn't an issue.

> Underneath Figure 17-11, it says "the initialization of
> a back-to-back cycle is not user definable"

That means you can't turn it off. If you read 32 bits from a 16-bit bus (as you're doing) it will do a back-to-back cycle and you can't stop it. You should program for "burst cycles" instead though. It is very simple and would load a cache line in about 20 clocks instead of 48 using "back to back".

> I would agree with you in general: Wait loops within a cycle
> of code in general are not desirable. When you've executed
> the current "embedded cycle", and are waiting for the next
> cycle to arrive, use the "waiting" part of the CPU at that junction.

That can be difficult to do "by hand". It is a lot easier if you can get the "delayed cycles" executed in a different thread (if you have a threaded OS on the hardware) or get the same effect by using linked interrupt functions, or one interrupt function with an internal state machine.

> My application is extremely I/O intensive, but is not fast.
> There are from 12 to 30 analog signals that I read...
> I can only get 15 A/D channel readings per second from that A/D converter.

And here's me complaining about the ADC on our unit taking 6us per conversion, and that was severely affecting other devices on the SPI bus. We had two sets of 11 conversions stacked up in two queued QSPI transactions, and that delayed any other transfers by 180us. They're now done 2 channels at a time.

> I'm looking into a better solution for UART TXD. Maybe
> an interrupt scheme with a buffer,

That is simple coding. Find an example somewhere and copy it. Google finds (but I haven't checked):

http://www.ganssle.com/articles/auarts.htm
http://ww1.microchip.com/downloads/en/DeviceDoc/uartintc.readme.pdf
http://www.atmel.com/atmel/acrobat/doc1451.pdf

> maybe there is a way to get at those FlexCAN buffers and use them.

CAN is weird and strange and is nothing like a UART. Use interrupts.

> I did not select this processor because of speed, mostly because
> of functionality, and those eTPU channels.

Does the ETPU give you enough resolution at 8MHz? I thought you might be generating overlapped PWM waveforms for power converters, but at 8MHz and 8 bit PWM has a maximum frequency of 32kHz.

> There are some problems with running at 150 MHz,
>
> When my processor is issued the STOP instruction, all clocks stop.
> Everything comes to a dead halt, but that is not an option I
> have available. So only one of the low power modes that halts
> the core and keeps the peripherals running would work.

There are four options for what happens on a "STOP". Only one of them stops all the clocks. The one that just stops the CPU works in all cases and saves power, and the one that stops the memory clocks too works as long as you don't have SDRAM or DMA.

> The core uses up 135/235 = 57.5% of the overall power
> of the processor

My testing on the MCF5329 showed that the clock distribution used a lot less than the CPU at full speed. I'd expect yours would do the same, except it is possible the TPU is a "high power peripheral" and that changes the balance if it is enabled and running, as it is in your case.

> My application is cost sensitive. I have no memory bus
> buffers like the demo board has. The documentation says
> that the MCF523x does not have a lot of drive capability
> on the memory bus, compared to the 68332 processor.

The demo boards are very unusual. They have the buffers for other reasons - to allow expandability and to put boards on busses and maybe to drive 5V memory busses. The CPUs are usually capable of driving memory SIMs without buffers in "high power mode".

The MCF5235 should default the busses to "low power mode". Check this. We migrated from MCF5235 to MCF5329 and the latter defaults to "high power mode". I didn't notice this and the overshoots and undershoots were so severe on our bus (CPU, SDRAM and FLASH chips) that the memory chips were returning corrupted data. I had to reprogram to "low power mode' on the busses (running at 80MHz) to get it reliable. This also reduces EMC. "Low power mode' is more than enough for your design, even if you did run it at full speed.

> I'm currently in the solar thermal world,

More environmentally responsible than me then. My product is in cars that have up to 600 kW of power (800 HP). No need to save a few milliamps in that product...

> The whole issue of "speed" is reletive to the the application
> at hand. Not a universal rule.

True. But I've never worked on anything where it wasn't a problem. Sometimes it was other programmers writing really stupid code though (how about a 200 word alphabetic sort taking 30 minutes on a 68000!).

Tom

Dogbert256 · ‎12-06-2011

Hey TomE,

All the stuff I do with my eTPU would be considered low bandwidth. Like driving a large ECM (electronically commutated motor), to PWM with periods that are huge (~ 1 second), to generating a 50% duty cycle 32 KHz oscillator for the A/D converter, etc...

My whole gripe about power was not really so much about savings of electricity. It is driven mostly by the need to power my electronics from AC/DC power supplies. Besides the power figures I gave you, which were only for the Freescale CPU and the 3.3 volt sub-system, there is the added power of the analog section, the WiFi module, and the power for all the sensors. All of this power starts adding up. I'm in the industrial world, and I have to guarantee operation from -40 to +70 (or +85) deg C. It is difficult to find inexpensive but high quality AC/DC power supplies. Aimtec has some really good ones in this category.

Bye 4 Now,

Dobert256

Instruction Cache, MCF523x processor, CodeWarriors V10.0

Instruction Cache, MCF523x processor, CodeWarriors V10.0

General