T4240 calculation speed

andrewesterholz · ‎10-08-2015

Hello,

I´ve compared the calculation speed of a T4240DSQ and a P4080DS. What I´ve found out is that the T4 has advantages by the calculation itself (especially at double multiplications) but reading/writing to values takes more time than on the P4080:

Here is the test code:

volatile double dA;

volatile double dB;

volatile double dC;

volatile double dD;

volatile double dE;

volatile double dF;

volatile double dG;

volatile double dH;

volatile double dI;

volatile double dJ;

void MemWrite10Int(int NoOfRuns, int CountVal)

{

while(NoOfRuns != 0)

{

iA = 0;

iB = 1;

iC = 2;

iD = 3;

iE = 4;

iF = 5;

iG = 6;

iH = 7;

iI = 8;

iJ = 9;

NoOfRuns -= CountVal;

}

NoOfRunsare 5e6

CountVal is 1

The execution time of these functions on the P4080 takes ~31 ms and on the T4240 ~71 ms. So the T4 is much slower.

The switches on the T4240 DSQ – board are configured to have SYS_CLK and DDR_CLK at 66.66 MHz

The RCW is configured as in the Quick start guide:

Reset Configuration Word (RCW):

00000000: 14180019 0c101914 00000000 00000000

00000010: 04383063 30548c00 1c020000 1d000000

00000020: 00000000 ee0000ee 00000000 000307fc

00000030: 00000000 00000000 00000000 00000020

We are currently uses a T4240 Revision 1. Operating system is VxWorks.

The only discrepancy that I have found is that the default RCW configuration of MEM_PLL_RAT is 0x18 (24d). This value is in the Reference Manual and in the Technical Data sheet described as reserved.

Could this be the reason for the slow value writing? Or is there any other reason why the T4 is significant slower writing to values?

LPP · ‎10-14-2015

This test code performs store accesses to L1 cache. e6500 (T4240) and e500mc (P4080) cores have the same store access latency 3:1. Similar store instruction sequence must show the same performance.

The reason of the performance difference in your observation might be that the different code generated by the compiler for these processors. Please, check the assembler listings to verify the case.

Have a great day,
Pavel

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

scottwood · ‎10-15-2015

3:1 is the fastest that load/store instructions can execute. It's not guaranteed and does not mean that the store queue can keep up with a constant stream of stores. My guess is that the difference comes from the cache architecture. On p4080 rev3, L1 can hold dirty cache lines, and L2 is part of the core. On t4240, L1 is read-only and stores must go to L2, and L2 is in the cluster which is shared by multiple cores.

Also, the test above is incomplete -- it declares a variety of double variables that are never used, and it references a bunch of "iA", "iB", etc that are not declared. If they were declared as "long" rather than "int", and the program were built as 64-bit on t4240, then you'd be writing out twice as much data on t4240 as p4080.

LPP · ‎10-27-2015

I have some thoughts on your reply:

>does not mean that the store queue can keep up with a constant stream of stores

The test code from Andre references the same 10 variables in loop. The sequence of stores will hit L1 cache ( after first iteration) and 3:1 latency is applicable to this case.

>On t4240, L1 is read-only and stores must go to L2

On t4240, L2 cache is a victim cache for data lines. The L2 contains only those cache entries that have been cast out from the L1 data cache.

The code dicussed in this thread doesn't need to castout data from L1 (only 10 variables involved). So, the differences in L2 architecture of P4080 and T4240 should not affect the results of this test.

scottwood · ‎10-27-2015

See section 5.4.3 ("Write-through cache") of the e6500 RM: "The L1 data cache is a write-through cache and does not contain modified data."

LPP · ‎10-27-2015

Agree. It's my fault. L2 is victim cache for e5500 not e6500.

andrewesterholz · ‎10-26-2015

You are right. I´ve copied the wrong values for the post (sorry for that).

Because of your answer I think the test results are not caused by a wrong setting or an defect. So my test results are plausible.

Thank you for your answer.

andrewesterholz · ‎10-14-2015

Hello,

thank you for the answer. For the test I´ve created a VxWorks DKM (Downloadable Kernel Module). The DKM was used by both systems without recompiling. So the assembler code is identical.

Best regards

Andre

T4240 calculation speed

T4240 calculation speed

QorIQ T4 Devices