Calculate time for 64 bit load from uncached memory in i.MX8

abdu_jaleel · ‎04-20-2020

Hi,

We want to calculate the time taken to read and compare two large buffer in uncached memory. The buffers are allocated using dma_alloc_coherent kernel function. For this we will need to find how much time a single 64 bit register load instruction takes. Can you give some pointers on how to find the time taken for the load to happen from uncached memory? We use the NXP i.MX 8MQuad Evaluation Kit.

Thank you

-abdul

Rita_Wang · ‎04-29-2020

I will confirm it for you.

abdu_jaleel · ‎04-29-2020

Thank you. Looking forward for your answer.

Rita_Wang · ‎04-30-2020

I have confirmed it with our expert the reply as follow, hope can do help for you:

It is hard to measure a single 64bit register load instrction period in modern system because DDR memory is unlike SRAM or Flash memory, which is simple read operation in data bus. DDR protocol is complicated and there are sinle read and burst read, which is not controlled by software.

If your customer want to meause data throughput, you can use standard tool such as tinybech and perf to get real the benchmark.

abdu_jaleel · ‎05-21-2020

Hi,

We did some further analysis and measurements (using software) and here is what we find.

It seems like the minimum burst length for DDR read is 16.

This means we will read 64 bytes from DDR even if we need only 8 bytes.

From DDR params we can calculate one 16 length burst or 64 byte read takes around 36 to 40 ns.

This matches exactly with the time we measure for bulk read from cached space.

And from experiments we confirmed that 64 byte reads (one cache line) is used in the case of cached space.

In the case of uncached access 8 bytes is expected to take the same 36 ns.

But from measurements we see that it is 4 times more.

It appears like in uncached access a burst lenght of 64 is used instead of 16.

Or a 1/4 clock is used.

Can you please tell us why do we see 4 times the expected time?

thanks

-abdul

Rita_Wang · ‎05-22-2020

These is the confirming information from our expert, you can refer to.

I don't think customer can manully control DDRPHY to access only 8 bytes.

I suggest customer to caputre a lot of DDR access timing by LA and to find out the proper timing as customer needed.

If they want to study the DDR timing, they can follow the JEDEC to do validation.

abdu_jaleel · ‎05-22-2020

Thank you for your answer.

We do not want to alter the DDRPHY or DDRC configuration, we just want to find out the reason for the extra time.

We see that the minimum burst length is 16 (or 64 bytes) that also corresponds to a cache line size.

So when we do a un-cached access of 8 bytes we would expect it still takes time of 64 bytes since that is the minimum.

But we are seeing it takes time of 256 bytes access.

Can you please confirm the following:

1. When we do a 8 byte read from un-cached space what is burst length used in the read command issued to DDR?

2. Is it anyway different from the read command issued for access from cached space?

thanks

-abdul

Rita_Wang · ‎05-25-2020

You can read the LPDDR4 JEDEC.

abdu_jaleel · ‎07-03-2020

Hi,

The timing calculation we shared above is already based on JEDEC.

Since you keep insisting on looking at LPDDR4 JEDEC, we had checked it again.

To repeat, our issue is that we can't match the timing we calculate based on spec with the observed values.

Is it possible to be more specific what you think we have missed from the JEDEC spec?

Below I share some more information regarding the calculation and observation.

Can you please go through this and help us find out the reason for the difference?

Looking at the DDR Performance Monitor Unit data we get the following information.

For every 8 byte read to the uncached read, there is on one read command, one activate command and one prefetch command issued to the DDR.

And refresh command is like one command for every 50 read commands or so.

Based on this the major contributor to time for reading 8 bytes are the following.

tRCD(activate to read command), tRL(read latency), tRP (precharge to activate) and burst read time.

We have read the DDR timing registers in the DDRC to get those values.

tRCD = 15, tRL = 14, tRP = 15 all in number of clocks.

And 16/2 = 8 clock cycles for the read burst.

With 1.6 GHz clock we can calculate the total time as

52 X 625ps = approx 35 ns. (neglecting the refresh time).

The observed value is approximaltely four times this. that is around 140 ns.

We are not able to find the reason for this difference.

Looking at the DDR performance monitor unit data, there isn't much other access to DDR during the measurement time.

We can reproduce this with a memcpy performed on buffers allocated using imxdmabuffers.

I have attached a small program to reproduce this using imxdmabuffers.

Can you check the above data and tell us what is the possible reason for the mismatch between the calculated and observed values?

Or what we have possibly missed in the calculation?

abdu_jaleel · ‎04-30-2020

Thank you.

We already have measured the throughput. Compared to reading and comparing from cached memory reading and comparing from uncached memory takes around 30 times more time. We want to theoretically analyze if this is expected by the architecture. Assuming consecutive 64 bit loads are from totally unrelated addresses, and assuming there is no other access to DDR in that time, is there a way to calculate some average time for a 16 MB data? currently we get around 250 ms. That is approx 125 ns for 64 bit.

Calculate time for 64 bit load from uncached memory in i.MX8

Calculate time for 64 bit load from uncached memory in i.MX8

i.MX 8 Family | i.MX 8QuadMax (8QM) | 8QuadPlus

i.MX 8M | i.MX 8M Mini | i.MX 8M Nano