p3041: Why dma transfer can influence on L1 misses? Any ideas ?

a_ · ‎08-29-2014

Hello everyone!

I investigating how one cpu core can spoil life other cpu core on p3041. And I've got unexpected results.

Test case:

Core 0 task: Read-modify-write in L1 data.

Core 1 task: Copying memory areas using the DMA.

All tasks are working without OS, on bare metal, into endless loop. The memory is separate: Core 0 can't read and write memory that access for Core1 and Core 1 can't read and write memory that access for Core 0. Coherency protocol is off.

I measure minimum, maximum and average task execution time and l1 and l2 misses.

Test results:

If working just Сore0: L1 data on Core0 has zero misses.

After release Core1: L1 data on Core0 has 5-7 misses and maximum task execution time grow up on 0.5%.

After disable Core1: L1 data on Core0 has zero misses again.

Anybody have idea why DMA transfer can influence on L1 data cache of CPU ?

p.s. Coherency protocol is off. Cores have a different memory bounds.

LPP · ‎09-03-2014

>Coherency protocol is off.

Coherency protocol can't be turned off. If a memory range has MMU attribute M=0 then the transactions issued by this core to this memory do not require memory coherency. However, other cores, I/O masters and DMA may issue transactions which require coherency. All the cores must snoop such transactions regardless of individual MMU settings.

In your case, DMA performs global transactions (FSL_DMA_SATR_SREAD_SNOOP, FSL_DMA_DATR_DWRITE_SNOOP). So, these DMA transactions are allways snooped by the cores.

>Anybody have idea why DMA transfer can influence on L1 data cache of CPU ?

1.

As global bus transactions are performed on the CoreNet bus, the core bus snooping logic monitors the addresses that are referenced. These addresses are compared with the cache tags. If there is a snoop hit, the core bus snooping logic responds with the appropriate actions.

In addition to the case of snoop hit, the core may suspend CoreNet transaction due to internal conflicts that prevent the appropriate snooping. For example, snoop attempt during the tag allocation period from load or store operations. So, global transactions may have some performance impact even if a snoop address doesn't hit L1/L2 and internal buffer's content.

2.

Your test reports L1 data misses when DMA is active. It is obvious it is a consequence of the snooping. You can charge PM counters for snoop events (E500MCRM Table 9-47) to see if any snoop hits exist.

Have a great day,
Pavel

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

View solution in original post

lunminliang · ‎09-01-2014

Can you please share your project to reproduce?

a_ · ‎09-02-2014

Unfortunately I can shared just basic functions and result of tests. All tests are launched via custom core driver of OS LynxOS-178 that works on cpu core 0. The driver just inits tlb and release cpu core by using u-boot api. The OS doesn't know about additional cpu cores.

Source code of memory test function:

/* buffer — 0x00080000 virtual, 0x04480000 physical*/

/* size = 24576 = 6*4096 */

unsigned long test(unsigned int *buffer, unsigned long size, unsigned long *l1_miss, unsigned long *l2_miss)

{

register unsigned int *b, *bp;

register unsigned int v;

unsigned long long begin, end;

unsigned long l1_b, l2a_b, l2h_b;

unsigned long l1_e, l2a_e, l2h_e;

register unsigned int r = 0x6d5;

register int i;

/*RMW loop first time*/

for (b = buffer; ((unsigned long)b - (unsigned long)buffer) < (size); b++) {

v = *b;

v++;

*b = v;

}

/*Store start timebase, L1 data reloads and L2 data access and hit*/

l1_b = perf_get_0(); /*pmr 41 */

l2h_b = perf_get_2(); /*pmr 113*/

l2a_b = perf_get_1(); /*pmr 112*/

begin = timebase_get();

for (i = 0; i < (PAGE_SIZE / 4) ; i++){

r = (r*4013 + 5) &0x3ff; /*calculate «random» element on page*/

/*modify «random» element on every page of buffer*/

for (b = buffer; ((unsigned long)b - (unsigned long)buffer) < (size); b+=1024) {

bp = (b + r);

v = *bp;

v++;

*bp = v;

}

/*Store finish L1 data reloads and L2 data access and hit*/

end = timebase_get();

l1_e = perf_get_0();

l2h_e = perf_get_2();

l2a_e = perf_get_1();

/*Calculate timebase, L1 data miss and L2 miss*/

*l1_miss = l1_e - l1_b;

*l2_miss = (l2a_e - l2a_b) - (l2h_e - l2h_b);

return (end - begin);

}

Source code of DMA test function (based on freescale examples):

/* src — 0x00080000 virtual, 0x04c80000 physical*/

/* dst — 0x00180000 virtual, 0x04d80000 physical*/

/* size = 1 048 576 */

unsigned long dma_copy (struct dma_control *dma, char *src, char *dst, unsigned int size)

{

unsigned long long begin, end;

uint32_t xfer_size, status;

src = get_phys (src);

dst = get_phys (dst);

out_dma32(&dma->satr, FSL_DMA_SATR_SREAD_SNOOP);

out_dma32(&dma->datr, FSL_DMA_DATR_DWRITE_SNOOP);

out_dma32(&dma->sr, 0xffffffff); /* clear any errors */

dma_sync();

xfer_size = size; //MIN(FSL_DMA_MAX_SIZE, count);

out_dma32(&dma->dar, (uint32_t)dst);

out_dma32(&dma->sar, (uint32_t)src);

out_dma32(&dma->satr,(in_dma32(&dma->satr)&0xfffffc00) | (uint32_t)((uint64_t)src >> 32));

out_dma32(&dma->datr,in_dma32(&dma->datr) | (uint32_t)((uint64_t)dst >> 32));

out_dma32(&dma->bcr, xfer_size);

dma_sync();

out_dma32(&dma->mr, FSL_DMA_MR_DEFAULT| 0x00028000);

dma_sync();

begin = timebase_get();

out_dma32(&dma->mr, 0x08000000|FSL_DMA_MR_DEFAULT | FSL_DMA_MR_CS|0x00028000);

dma_sync();

do {

status = in_dma32(&dma->sr);

} while (status & FSL_DMA_SR_CB);

out_dma32(&dma->mr, in_dma32(&dma->mr) & ~FSL_DMA_MR_CS);

end = timebase_get();

dma_sync();

if (status != 0){

printf ("DMA Error: status = %x\n", status);

return 0;

}

return (end — begin);

}

Memory mapped parameters:

Core 1: TLB1, RPN: 0x04400; EPN: 0 size:4M; WIGM = 0000b

Core 2: TLB1, RPN: 0x04c00; EPN: 0 size:4M; WIGM = 0000b

Typical memory test result:

min. / max. / avg. loop execution time: 46097/ 46097/46097 ns

min. / max. / avg. L1 cache miss: 0 / 0 / 0

Memory test result with parallel DMA test on other CPU core:

min. / max. / avg. loop execution time: 46444/ 46580/46505 ns

min. / max. / avg. L1 cache miss: 0 / 5 / 0

lunminliang · ‎09-04-2014

Also please check the Snoop event as LPP suggests, this could be useful to analyze.

I am interested how did you implement the core1 release here.

a_ · ‎09-04-2014

“The catch miss number 5-7 times looks not high. “

it's depends from tasks.

“How did you do the test?”

I developed driver for this and now I just type a command in lynxos shell. If you want launch bare-metal application you may use u-boot shell. For example:

> cpu <core number> release <entry point> - - -

“What's the endless loop?”

yes, in fact, but I'm calling those functions about 1000 times, write measure to serial port and repeat again.

“Also please check the Snoop event as LPP suggests, this could be useful to analyze.”

sure, i've written results in comments above

lunminliang · ‎09-03-2014

Thank you, it seems not easy to reproduce on my site as you are using some OS driver.

The catch miss number 5-7 times looks not high. We have a read your code, and I would like to confirm with you some more:

1. How did you do the test? boot up core0 and release core1 manually? Can you please share more info about how to "after release core1, after disable core1". Is it OK to share some critical code for this?

2. What's the endless loop? Is it something like
while(1){
test(); // core0
}
and
while(1){
dma_copy(); //core1
}

LPP · ‎09-03-2014

>Coherency protocol is off.

Coherency protocol can't be turned off. If a memory range has MMU attribute M=0 then the transactions issued by this core to this memory do not require memory coherency. However, other cores, I/O masters and DMA may issue transactions which require coherency. All the cores must snoop such transactions regardless of individual MMU settings.

In your case, DMA performs global transactions (FSL_DMA_SATR_SREAD_SNOOP, FSL_DMA_DATR_DWRITE_SNOOP). So, these DMA transactions are allways snooped by the cores.

>Anybody have idea why DMA transfer can influence on L1 data cache of CPU ?

1.

As global bus transactions are performed on the CoreNet bus, the core bus snooping logic monitors the addresses that are referenced. These addresses are compared with the cache tags. If there is a snoop hit, the core bus snooping logic responds with the appropriate actions.

In addition to the case of snoop hit, the core may suspend CoreNet transaction due to internal conflicts that prevent the appropriate snooping. For example, snoop attempt during the tag allocation period from load or store operations. So, global transactions may have some performance impact even if a snoop address doesn't hit L1/L2 and internal buffer's content.

2.

Your test reports L1 data misses when DMA is active. It is obvious it is a consequence of the snooping. You can charge PM counters for snoop events (E500MCRM Table 9-47) to see if any snoop hits exist.

Have a great day,
Pavel

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

a_ · ‎09-04-2014

Natürlich! It's snoop transactions. It's my fault, I copied DMA example and should have understand what it's doing. But I really couldn't imagine that any global transactions will influence on L1 cache

Snoop hit counter may not incremented because addresses of L1 cache data and data of dma transaction are different. But during DMA test snoop request counter grown up to 3000 transactions per cycle. Perhaps these transaction led to locking bus or something else. I will check other counters of cpu.

Memory test result with parallel DMA test(with attribute FSL_DMA_*_SNOOP) on other CPU core:

min. / max. / avg. loop execution time: 46429/46577/46504 ns

min. / max. / avg. L1 cache miss: 0 / 6 / 0

min. / max. / avg. Snoop hits: 0 / 0 / 0

min. / max. / avg. Snoop requests: 2887/3247/3122

Memory test result with parallel DMA test(with attribute FSL_DMA_*_NO_SNOOP) on other CPU core:

min. / max. / avg. loop execution time: 46097 / 46097 / 46097 ns

min. / max. / avg. L1 cache miss: 0 / 0 / 0

min. / max. / avg. Snoop hits: 0 / 0 / 0

min. / max. / avg. Snoop requests: 354 / 368 / 355

Thanks so much for detailed answer and ya rad, chto v rossii esche kto-to etim zanimaetsya.