L1 D-Cache Flushing

是

Recently I was handling a problem of E500 Data Cache flush. I have a study on this. Below is a summary/record of the study, here to post it for someone who may be interested in.

1. L1 D-Cache Flushing

About Data Cache flush, there’s some description in E500 reference manual as below:

Any modified entries in the data cache can be copied back to memory (flushed) by using a dcbf instruction or by executing a series of 12 uniquely addressed load or dcbz instructions to each of the 128 sets. The address space should not be shared with any other process to prevent snoop hit invalidations during the flushing routine. Exceptions should be disabled during this time so that the PLRU algorithm is not disturbed.

The following methods can be used to flush a region in the L1 cache:

• Perform reads to any 48-Kbyte region, then execute dcbf instructions to that region. Note that a 48-Kbyte region must be used to ensure that the PLRU algorithm flushes all of the cache entries (12 x 128 sets x 32 bits = 48 Kbytes).

• Perform reads from any 48-Kbyte region that is guaranteed to not be modified in the L1 cache (for example, a ROM region).

• Execute dcbz instructions to any 48-Kbyte scratch section, then invalidate the cache. Note that it is necessary to use a scratch region because some zeroed lines will be cast out.

…

On the e500v2 the HID0 register contains a field, DCFA (data cache flush assist), that, when set, forces the data cache to ignore invalid sets on miss replacement selection and follow the replacement sequence defined by the PLRU bits. This reduces the series of uniquely addressed load or dcbz instructions to eight per set. The bit should be set just before beginning a cache flush routine and should be cleared when the series of instructions is complete.

2. The Questions

As the Data Cache size is 32 Kbytes, why 48 Kbytes region needs to be performed on? The manual uses an equation “12 x 128 sets x 32 bytes = 48 Kbytes” (here I assume it should be 32 bytes instead of 32 bits), it says a series of 12 uniquely addressed load to each of the 128 sets, but there is no explanation why it’s 12, and why uniquely?

And more, it mentions the HID0 register filed DCFA. It says setting this field reduces the series of uniquely addressed load to eight per set. Why? This time there are more words “forces the data cache to ignore invalid sets on miss replacement selection and follow the replacement sequence defined by the PLRU bits”, how to understand?

3. Data Cache Basics

To answer the questions above, we need to focus on the miss replacement selection algorithm as it refers to. But before this we should be clear about the Data Cache organization first, and know the “invalid sets” status.

The E500 reference manual use below figure to describe the L1 D-Cache organization

From this figure, we get 128 sets, 8 ways per set, and 32 bytes per way, which is 128 x 8 x 32 = 32 Kbytes. Here a block, also called a line/way, contains 8 words, or 32 bytes.

Where a piece of data in memory should be placed into D-Cache? In other words what’s the mapping method between memory and the Cache?

Each block is loaded from 8-words boundary, i.e. physical address bits PA[27:31] are zeros. Byte within a block is located by PA[27:31].

The set is selected by physical address bits PA[20:26], totally 2⁷=128, there is one set for each PA[20:26], or we say it’s one to one mapping.

The tags consist of physical address bit PA[0:19], there are totally 2²⁰ kinds of tags but there are only 8 ways in each set, it’s one to multiple mapping. So we need replacement algorithm, in e500 the PLRU (pseudo-least-recently-used) replacement algorithm is used.

4. Miss Replacement

In the reference manual it says “This algorithm prioritizes the replacement of invalid entries over valid ones (starting with way 0). Otherwise, if all ways are valid, one is selected for replacement according to the PLRU bit encodings shown in Table 11-8.”

This is where the difference happened. Let’s analyze the easy case first, if HID0 register filed DCFA is set, all the ways are treated the same, valid or invalid will be ignored, only PLRU take effect. Assume PLRU bits are all zeros, below table shows the order ways be selected:

B0	B1	B2	B3	B4	B5	B6	Ways selected
0	0	0	0	0	0	0	L0
1	1	1	0	0	0	0	L4
0	1	1	0	0	1	0	L2
1	0	1	0	1	1	0	L6
0	0	0	0	1	1	1	L1
1	1	0	0	1	1	1	L5
0	1	1	0	1	0	1	L3
1	0	1	0	0	0	1	L7

After 8 replacements, all the 8 ways are selected. Change PLRU bits to other values it’s still 8 replacements, that’s way it reduce 12 uniquely addressed load to 8.

Then why it’s 12 if DCFA field is not set? Assume way 0,1,2,3 are invalid and way 4,5,6,7 are valid, and PLRU bits are all zeros, we have again the table as below:

B0	B1	B2	B3	B4	B5	B6	Ways selected
0	0	0	0	0	0	0	L0
1	1	0	1	0	0	0	L1
1	1	0	0	0	0	0	L2
1	0	0	0	1	0	0	L3
1	0	0	0	0	0	0	L4
0	0	1	0	0	1	0	L0
1	1	1	1	0	1	0	L6
0	1	0	1	0	1	1	L2
1	0	0	1	1	1	1	L5
0	0	1	1	1	0	1	L1
1	1	1	0	1	0	1	L7

After 11 replacements, all the 8 ways are selected. Assume way 0,1,2,3,6,7 are invalid and way 4,5 are valid, and PLRU B5 is zero, we have again the table as below:

B0	B1	B2	B3	B4	B5	B6	Ways selected
-	-	-	-	-	0	-	L0
1	1	-	1	-	0	-	L1
1	1	-	0	-	0	-	L2
1	0	-	0	1	0	-	L3
1	0	-	0	0	0	-	L6
0	0	0	0	0	0	1	L7
0	0	0	0	0	0	0	L0
1	1	1	0	0	0	0	L4
0	1	1	0	0	1	0	L2
1	0	1	0	1	1	0	L6
0	0	0	0	1	1	1	L1
1	1	0	0	1	1	1	L5

After 12 replacements, all the 8 ways are selected. Here 12 is the maximum replacement number, any other case will finish all 8 ways selected within 12 replacements.

There are Cache operation code examples in NCSW for e500, including L1 D-Cache flushing, as attached. Also the whole article in attached as individual document.