(Edit: Definitely Not Answered. What put that "Assumed" thing there?)
Starting with the questions.
Am I doing the right thing here? Am I missing anything?
Could there be something wrong with the way the Kernel sets up the CPU? With the way the Bootstrap left the CPU set up?
Can others please run this test and let me know what results they get?
I'm running mainstream Linux 3.4 on an i.MX53 board.
I'll try and repeat these tests on a i.MX53 QSB on Monday.
I wrote some code in a FlexCAN driver to measure how long it took to read and write the FlexCAN device registers, and as expected, it took a L O N G time. When using the standard I/O Macros (which added a Memory Barrier instruction to each read) it took about 180ns. Removing the Barrier reduced it to 130ns. For an 800MHz CPU that's still over 100 CPU clocks!
I then wrote some code to measure the "speed" of reading the OCRAM from Linux user-space, which is meant to be capable of 1 or 2 clock access. Admittedly at 133MHz (the ahb_clk_root clock). But that's still a multiple of 7.5ns.
Except I'm measuring 129ns. That's 17 133MHz clocks or 103 CPU clocks.
Here's my measured results for reading Cached RAM, OCRAM, GPU RAM, raw DDR and the Boot ROM:
Memory Type Address Time us MiW/Sec ns/word
Normal Cached RAM User 2730 366 2.55
i.RAM/OCRAM 0xf8000000 135493 7.38 129
GPU3D GMEM 0xf8020000 189736 5.27 181
NAND FLASH Buffer 0xf7ff0000 155717 6.42 139
CSD0 to raw DDR 0x70000000 178597 5.60 171
Boot ROM 0x00000000 157992 6.33 151
I've attached the source and executable (should run under Linux on i.MX53 or any i.MX6) as "memio.tar.gz"..
Here's how to run the program. WARNING, that it WRITES to the memory and reads it back (to check that I've got the right address and am addressing read-write memory), so if you run it on Peripheral Space or the external DRAM the OS is running in you might cause a crash.
root@triton1:/tmp# nice --20 ./memio 0xf8000000 4
Offset 0xf8000000, count 4, pagesize 4096
[0] = 0x00000000
[1] = 0x00000001
[2] = 0x00000002
[3] = 0x00000003
DUMMY 1MiB runs took 2730 usec, giving 6
REAL 1MiB runs took 135493 usec, giving 71937030
GPU3D GMEM at 0xf8002000:
root@triton1:/tmp# nice --20 ./memio 0xf8020000 4
Offset 0xf8020000, count 4, pagesize 4096
[0] = 0x00000000
[1] = 0x00000001
[2] = 0x00000002
[3] = 0x00000003
DUMMY 1MiB runs took 2676 usec, giving 6
REAL 1MiB runs took 189736 usec, giving -805021690
Unless I can't count and got my code wrong, "135493 usec" for 1 MiB (1048576) is 129us for a single word read.
As a sanity-check, it is able to read cached RAM (meaning it is reading the cache) in 2.5ns, or at about 390MHz on an 800MHz CPU.
I'll try to paste the code in here.
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <fcntl.h> #include <sys/mman.h> #include <sys/time.h> #include <ctype.h> #define __USE_BSD #include <unistd.h> /* for getpagesize() */ #undef __USE_BSD #include <string.h> /* for memset() */ int main(int argc, char *argv[]) { struct timeval now, then; if (argc < 3) { printf("Usage: %s <phys_addr> <count>\n", argv[0]); return 0; } off_t offset = strtoul(argv[1], NULL, 0); size_t count = strtoul(argv[2], NULL, 0); int fd = open("/dev/mem", O_RDWR | O_SYNC); if (fd < 0) { perror("open(/dev/me/) failed"); return -1; } int pagesize = getpagesize(); unsigned char *mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset); if (mem == NULL) { printf("Can't map memory\n"); return -1; } printf("Offset 0x%8x, count %d, pagesize %d\n", (unsigned int)offset, count, pagesize); int i, j; time_t usec; int acc = 0; int dummy[4096]; uint32_t *p32 = (uint32_t *)(&mem[0]); memset(dummy, 0, sizeof(dummy)); for (i = 0; i < count; ++i) { p32[i] = i; } for (i = 0; i < count; ++i) { printf("[%d] = 0x%08x\n", i, p32[i]); acc += p32[i]; } p32 = (uint32_t *)(&dummy[0]); gettimeofday(&now, NULL); for (j = 0; j < 1024; j++) { for (i = 0; i < pagesize/4/4; ++i) { acc += p32[i]; acc += p32[i + 1]; acc += p32[i + 2]; acc += p32[i + 3]; } } gettimeofday(&then, NULL); usec = 1000000 * (then.tv_sec - now.tv_sec) + (then.tv_usec - now.tv_usec); printf("DUMMY 1MiB runs took %u usec, giving %d\n", (unsigned int)usec, acc); p32 = (uint32_t *)(&mem[0]); gettimeofday(&now, NULL); for (j = 0; j < 1024; j++) { for (i = 0; i < pagesize/4/4; ++i) { acc += p32[i]; acc += p32[i + 1]; acc += p32[i + 2]; acc += p32[i + 3]; } } gettimeofday(&then, NULL); usec = 1000000 * (then.tv_sec - now.tv_sec) + (then.tv_usec - now.tv_usec); printf("REAL 1MiB runs took %u usec, giving %d\n", (unsigned int)usec, acc); return 0; }
Tom
Message was edited by: Tom Evans to dispute the "Assumed Answered".
Original Attachment has been moved to: memio.tar.gz
We sent a query on this back through the manufacturer of the board we use, and got this reply:
If I was measuring 30ns, then the above would explain it. It would mean that the memory was taking 4 clocks at 133MHz, which is 7.5ns per clock.
Except I reported 170ns, not 30ns, so that "explanation" didn't help at all. It doesn't explain my 130ns measurement from userspace either.
Tom