Internal Memory-Mapped Registers as Local RAM?

ChrisNielsen · ‎02-11-2014

I'm doing detailed pipeline analysis for a critical section of high-speed DSP code. I'm considering using a trick to speed up the code -- place a heavily used (but small) data item in on-board RAM (IMMR) instead of pulling the data from cache/DDR. I have other larger data items that have no chance of fitting on-board so the data cache will already be challenged presenting the large data. I'm just trying to get a single 16 word data element on-board with guaranteed single cycle access to minimize cache data traffic.

Question: Are any wait states imposed on IMMR reads or writes or should they execute as fast as any data in cache?

Question: Ref Manual section 3.2 says:

To guarantee that the results of any sequence of writes to configuration registers are in effect, the final configuration register write should be followed immediately by a read of the same register, and that should be followed by a sync instruction. Then accesses can safely be made to memory regions affected by the configuration register write.

If I have to execute W/R/Sync for every IMMR access then it will be too slow and I will pursue another path. The W/R/Sync seems quite unusual; I've never seen this requirement before for on-chip peripheral registers in any other CPU. Or, am I misinterpreting the language and it ONLY applies to setting up IMMBAR? If the triple-access is indeed the case, can you share another solution to achieve my goal?

Question: I need 16 32-bit RAM locations (treat some IMMR regs as RAM==trick). My candidates are:

- PCI Mailbox Registers (128 words)

- Ethernet (eTSEC) MACXXADDRY Registers (32 words)

Do you see any issues (side effects) from using these registers as RAM or should they function just fine?

I'm open to any solution. Is there a better way? The above are just a few examples of what I discovered quickly. The basic trick is to repurpose unused peripheral regs as local RAM.

Have others used this trick before? Success?

Thanks, Chris

scottwood · ‎02-11-2014

I think locking two cache lines is a better solution.

Nothing you can do will get you "guaranteed single cycle access" -- the best you can do is locking in the L1 cache, which should guarantee an access latency of two or three cycles, depending on which chip you're using (you should always specify such details...). Accessing IMMR should be slower than that since it has to leave the core -- you'd have to benchmark it to see how much slower.

It is not necessary to do a "write, read, sync" sequence for every IMMR write (I can't look up section 3.2 of the refman because I don't know what chip this is, much less the revision of the manual). Normally you need to do appropriate synchronization (could be sync or eieio depending on circumstances) to ensure that any ordering you care about is preserved from the device's point of view, and a readback is required to ensure that the write has completed before you take some other action that depends on it that can't be synchronized using sync/eieio (e.g. making sure the interrupt controller sees that an interrupt is masked before you enable MSR[EE]). For IMMR registers that you're just using as storage, you don't care how the device views the ordering and thus none of this matters.

ChrisNielsen · ‎02-11-2014

Good answers. Sorry on the processor detail, MPC8308. Can you share the access latency for this chip from L1? thanks

scottwood · ‎02-11-2014

The load latency is two cycles, as noted in the "Instruction Timing" section of the e300 core reference manual.