i.MX53 i.RAM (OCRAM) seems very slow, test code included.

TomE · ‎05-08-2015

(Edit: Definitely Not Answered. What put that "Assumed" thing there?)

Starting with the questions.

Am I doing the right thing here? Am I missing anything?

Could there be something wrong with the way the Kernel sets up the CPU? With the way the Bootstrap left the CPU set up?

Can others please run this test and let me know what results they get?

I'm running mainstream Linux 3.4 on an i.MX53 board.

I'll try and repeat these tests on a i.MX53 QSB on Monday.

I wrote some code in a FlexCAN driver to measure how long it took to read and write the FlexCAN device registers, and as expected, it took a L O N G time. When using the standard I/O Macros (which added a Memory Barrier instruction to each read) it took about 180ns. Removing the Barrier reduced it to 130ns. For an 800MHz CPU that's still over 100 CPU clocks!

I then wrote some code to measure the "speed" of reading the OCRAM from Linux user-space, which is meant to be capable of 1 or 2 clock access. Admittedly at 133MHz (the ahb_clk_root clock). But that's still a multiple of 7.5ns.

Except I'm measuring 129ns. That's 17 133MHz clocks or 103 CPU clocks.

Here's my measured results for reading Cached RAM, OCRAM, GPU RAM, raw DDR and the Boot ROM:

Memory Type Address Time us MiW/Sec ns/word

Normal Cached RAM User 2730 366 2.55

i.RAM/OCRAM 0xf8000000 135493 7.38 129

GPU3D GMEM 0xf8020000 189736 5.27 181

NAND FLASH Buffer 0xf7ff0000 155717 6.42 139

CSD0 to raw DDR 0x70000000 178597 5.60 171

Boot ROM 0x00000000 157992 6.33 151

I've attached the source and executable (should run under Linux on i.MX53 or any i.MX6) as "memio.tar.gz"..

Here's how to run the program. WARNING, that it WRITES to the memory and reads it back (to check that I've got the right address and am addressing read-write memory), so if you run it on Peripheral Space or the external DRAM the OS is running in you might cause a crash.

root@triton1:/tmp# nice --20 ./memio 0xf8000000 4

Offset 0xf8000000, count 4, pagesize 4096

[0] = 0x00000000

[1] = 0x00000001

[2] = 0x00000002

[3] = 0x00000003

DUMMY 1MiB runs took 2730 usec, giving 6

REAL 1MiB runs took 135493 usec, giving 71937030

GPU3D GMEM at 0xf8002000:

root@triton1:/tmp# nice --20 ./memio 0xf8020000 4

Offset 0xf8020000, count 4, pagesize 4096

[0] = 0x00000000

[1] = 0x00000001

[2] = 0x00000002

[3] = 0x00000003

DUMMY 1MiB runs took 2676 usec, giving 6

REAL 1MiB runs took 189736 usec, giving -805021690

Unless I can't count and got my code wrong, "135493 usec" for 1 MiB (1048576) is 129us for a single word read.

As a sanity-check, it is able to read cached RAM (meaning it is reading the cache) in 2.5ns, or at about 390MHz on an 800MHz CPU.

I'll try to paste the code in here.

#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <fcntl.h> #include <sys/mman.h> #include <sys/time.h> #include <ctype.h> #define __USE_BSD #include <unistd.h>     /* for getpagesize() */ #undef __USE_BSD #include <string.h>     /* for memset() */  int main(int argc, char *argv[]) {     struct timeval now, then;      if (argc < 3) {         printf("Usage: %s <phys_addr> <count>\n", argv[0]);         return 0;     }      off_t offset = strtoul(argv[1], NULL, 0);     size_t count = strtoul(argv[2], NULL, 0);      int fd = open("/dev/mem", O_RDWR | O_SYNC);     if (fd < 0)     {         perror("open(/dev/me/) failed");         return -1;     }     int pagesize = getpagesize();     unsigned char *mem = mmap(NULL, pagesize,                             PROT_READ | PROT_WRITE,                             MAP_SHARED, fd, offset);     if (mem == NULL) {         printf("Can't map memory\n");         return -1;     }      printf("Offset 0x%8x, count %d, pagesize %d\n",             (unsigned int)offset, count, pagesize);      int i, j;     time_t usec;     int acc = 0;     int dummy[4096];     uint32_t *p32 = (uint32_t *)(&mem[0]);      memset(dummy, 0, sizeof(dummy));      for (i = 0; i < count; ++i)     {         p32[i] = i;     }     for (i = 0; i < count; ++i)     {         printf("[%d] = 0x%08x\n", i, p32[i]);         acc += p32[i];     }     p32 = (uint32_t *)(&dummy[0]);     gettimeofday(&now, NULL);     for (j = 0; j < 1024; j++)     {         for (i = 0; i < pagesize/4/4; ++i)         {             acc += p32[i];             acc += p32[i + 1];             acc += p32[i + 2];             acc += p32[i + 3];         }     }     gettimeofday(&then, NULL);     usec = 1000000 * (then.tv_sec - now.tv_sec) +                     (then.tv_usec - now.tv_usec);     printf("DUMMY 1MiB runs took %u usec, giving %d\n",         (unsigned int)usec, acc);      p32 = (uint32_t *)(&mem[0]);     gettimeofday(&now, NULL);     for (j = 0; j < 1024; j++)     {         for (i = 0; i < pagesize/4/4; ++i)         {             acc += p32[i];             acc += p32[i + 1];             acc += p32[i + 2];             acc += p32[i + 3];         }     }     gettimeofday(&then, NULL);     usec = 1000000 * (then.tv_sec - now.tv_sec) +                     (then.tv_usec - now.tv_usec);     printf("REAL 1MiB runs took %u usec, giving %d\n",         (unsigned int)usec, acc);      return 0; }

Tom

Message was edited by: Tom Evans to dispute the "Assumed Answered".

Original Attachment has been moved to: memio.tar.gz

TomE · ‎12-04-2017

I complained about this issue being set to "Assumed Answered" when it hadn't been. After over 2 years and 6 months, this has been removed:

https://community.nxp.com/thread/355572

Tom

TomE · ‎05-14-2015

I've repeated these measurements on a Freescale i.MX53 Quick Start Board (QSB) running at 800MHz.

root@lucid-desktop:/tmp# nice --20 ./memio 0xf8000000 1
Offset 0xf8000000, count 1, pagesize 4096
[0] = 0x00000000
DUMMY 1MiB runs took 2367 usec, giving 0
REAL 1MiB runs took 131272 usec, giving 2139552768

So updating my previous table:

> Memory Type       Address     Time us MW/Sec ns/word
> =====================================================
> Normal Cached RAM User          2730   366      2.55
> i.RAM/OCRAM       0xf8000000 135493   7.38   129
> GPU3D GMEM        0xf8020000 189736   5.27   181
> NAND FLASH Buffer 0xf7ff0000 155717   6.42   139
> CSD0 to raw DDR   0x70000000 178597   5.60   171
> Boot ROM          0x00000000 157992   6.33   151
> QSB i.RAM/OCRAM   0xf8000000 131727   7.59   125

> QSB Cached RAM User 2367 422 2.26

That indicates there's nothing fundamentally wrong with the internal bus, bridge and priority settings of the CPU on our board compared to the way Freescale program theirs.

I also tried to copy data from the iRAM using a NEON-based memory copy function that uses "vld1.8" and "pld" instructions. That immediately throws a "Bus error" and logs the following:

[ 1117.412351] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6fb5000
[ 1453.849783] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6f4d000
[ 1821.839978] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6f78000
[ 1858.528467] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6f15000
[ 1924.515497] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6f6f000

So why can't this memory be read by those instructions?

Tom

Yuri · ‎05-29-2015

As for NEON-based memory copy function : there is "ENGcm11413 ARM :

A Neon load from device memory type can result in data abort" from
the Errata.

http://cache.freescale.com/files/32bit/doc/errata/IMX53CE.pdf

Regards,

Yuri

TomE · ‎06-01-2015

> As for NEON-based memory copy function : there is "ENGcm11413 ARM :

Thanks. I should have spotted that in amongst all the other bugs. I didn't think to look in the Errata for what looked more like a software or OS problem.

Tom

admin · ‎05-18-2015

Yuri,

It seems like Tom has more questions, is that something you can help on?

Yuri · ‎05-11-2015

Hardly we can expect to get maximal performance fro the OCRAM, when accessing it

via Linux file system (because of high overhead cost).

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

TomE · ‎05-12-2015

Yuri says:

> Hardly we can expect to get maximal performance fro the OCRAM, when accessing it

> via Linux file system (because of high overhead cost).

I'm pretty sure that's not how "mmap()" works. As long as the device supports, it, "real memory" is directly mapped via the MMU into user space. You get an MMU Trap on the FIRST access, but all subsequent one are "for free" (and are the same as reading memory) until you walk off of that mapped address and trigger a read for the next one.

In my original post I mentioned that I wrote code in the FlexCAN driver to measure how long it took to read the FlexCAN registers. I also added code to that to map the OCRAM, and read that directly and timed those reads. They took 130ns each.

That's exactly the same time I'm getting when reading from user space via mmap()'d memory with the program I wrote. So I'm pretty sure it is reading memory without any overheads.

If you can point me to any documentation that teaches me otherwise, then please do.

There's a bug in the program exposed by higher optimisation levels. It should be "i += 4| and not "i++", and the termination condition needs one "/4" removed.

Tom

Yuri · ‎05-28-2015

Even for 133 MHz clocking of the OCRAM, the internal
latency ( of 170 ns) is mainly caused by the internal bus.

Regards,

Yuri.

TomE · ‎05-28-2015

> Even for 133 MHz clocking of the OCRAM, the internal
> latency ( of 170 ns) is mainly caused by the internal bus.

170ns is NINETEEN clocks. The OCRAM documentation mentions "one or two clocks". Where are the other 18 clocks being wasted?

Can you provide better references or some technical details?

Is there a way to fix this? 170ns is 1990's memory speed.

Tom

Yuri · ‎05-28-2015

For GMEM my results (ARM core accesses) : ~250 ns per 32-bit word

~Yuri.

Yuri · ‎05-28-2015

The i.MX53 has complicated structure, it includes many peripheral modules,

several internal buses, as result some delays may be observed because of arbitration,

bus turn-arounds, etc. On my i.MX53 board I see ~100 ns per 32-bit word, under
bare metal (OS-less) environment, when using ARM block copy instruction (LDM / STM)
of 32 bytes. Note, OCRAM is mainly intended for boot, and it is recommended to use
DRAM for performance.

Regards,

Yuri.

TomE · ‎05-28-2015

> The i.MX53 has complicated structure, it includes many peripheral modules,

> several internal buses, as result some delays may be observed because of arbitration,

> bus turn-arounds, etc.

"Some"? You're measuring 100ns, which is 40 400MHz AXI clocks which is what it uses to get through the AXI fabric. It is still 13 133MHz AHB clocks to get to what is documented as a one-or-two-clock device.

> Note, OCRAM is mainly intended for boot, and it is recommended to use
> DRAM for performance.

Not according to the Reference Manual. Maybe you're reading the i.MX6 Manuals, which do detail OCRAM's use by the bootstrap. The i.MX53 manual doesn't mention the OCRAM at all in the Boot chapter. The i.MX53 manual only says:

Table 1-1. Digital and Analog Blocks (continued)

In i.MX53, the OCRAM is used for controlling the 128KB
multimedia RAM, via a 64-bit AXI bus.

So the Reference Manual says it is "Multimedia RAM", which I'd read as being there to improve graphics performance somehow. There's nothing in there saying the DRAM is faster or that the OCRAM is so slow. The i.MX6 manual even suggest using the OCRAM as an IPU buffer memory! Maybe this speed problem exists in the i.MX53 and was fixed in the i.MX6. In it the OCRAM is connected to "MX6FAST3" instead of being connected via the AHB, so maybe it is a lot better in the i.MX6 than i.MX5?

> For GMEM my results (ARM core accesses) : ~250 ns per 32-bit word

That's the GARB getting in there, but an extra 150ns for it is excessive, and also a lot slower than I measured at 180ns. I hope the drivers don't try to send graphics elements to the GPU via the GMEM. That would make it very slow.

Chapter 52 isn't very good. "Figure 52-1. On-chip RAM Block Diagram" is completely missing, and this bit omission made it through multiple manual versions. The only timing information in the OCRAM chapter says things like:

52.3.1 Read Data Wait State
When the wait state is enabled, it will cost 2 cycles for each read access, (each beat of a
read burst).
This can avoid the potential timing problem caused by the relatively longer memory
access time at higher frequency.
When this feature is disabled, it only costs 1 clock cycle to finish a read transaction, that
is, to get read data back in the next cycle of read request becomes valid on the bus.

That is confusing as the clock provided to the OCRAM isn't adjustable. Is the 133MHz clock considered a "higher frequency" or not? That chapter is a "general purpose description" of the "OCRAM Module", but what is missing is a section saying how that module has been integrated into this specific SOC. Either that or it needs a "Data Sheet" section giving the timing details of the OCRAM so as to know if it needs the extra wait-states or not.

Is there a "Data Sheet" giving details of the timings and timing requirements of the internals of the SOC? Can you provide better references or some technical details? Is there something there that needs NDA to access?

Tom

Yuri · ‎06-01-2015

Below are some considerations about (at least theoretical) performance of SDRAM and OCRAM.
Typical OCRAM access (to read 32-bit word) :

Bus arbitration -> address phase -> wait -> data phase –> bus turn-around

So, 5 cycles (at least three – of 133 MHz) per single 32-bit word.

Typical SDRAM access (burst of 8 beats) :

9 cycles (arbitration, RAS, CAS, CL) + 4 cycles (DDR 8 beats).

So, 13 cycles (of 400 MHz) per 8 32-bit words.

As resume : SDRAM performance is significantly higher than OCRAM’s one.

Regards,
Yuri.

Yuri · ‎05-28-2015

The scheme of the Figure 1-2 (i.MX53 Simplified Block Diagram) of the i.MX53
Reference Manual is very “simplified”. ARM core accesses the OCRAM via the
following path :

ARM platform -> (AXI I/F, 200 MHz ) -> EXTMC (M4IF) -> OCRAM

As for the EXTMC, please take a look at Figure 5-1 (EXTMC High Level Block Diagram).

Regards,

Yuri.

TomE · ‎05-29-2015

> ARM platform -> (AXI I/F, 200 MHz ) -> EXTMC (M4IF) -> OCRAM

That still doesn't explain 100ns. At 200MHz (5ns) that's still 20 clocks.

It still isn't information that anyone can use to DESIGN a system to meet performance requirements.

I shouldn't have to be reverse-engineering the timing. It is generally too late to find that out then.

You work for Freescale, and seem to be "reverse-engineering" this as well, as you're taking measurements rather than referring to documents or experts.

Tom

TomE · ‎05-10-2015

We sent a query on this back through the manufacturer of the board we use, and got this reply:

Got this feedback from Freescale:

170ns is correct.

"The OCRAM runs at 133MHz 64-bit, while the DRAM interface is 400MHz 128-bit for 64-bit DRAM."

If I was measuring 30ns, then the above would explain it. It would mean that the memory was taking 4 clocks at 133MHz, which is 7.5ns per clock.

Except I reported 170ns, not 30ns, so that "explanation" didn't help at all. It doesn't explain my 130ns measurement from userspace either.

Tom

i.MX53 i.RAM (OCRAM) seems very slow, test code included.

i.MX53 i.RAM (OCRAM) seems very slow, test code included.

Graphics & Display

i.MX53

i.MX6_All

i.MX6DL

i.MX6Dual

i.MX6Quad

i.MX6S

i.MX6SL

Linux

Suspected Software Defect