MMDC profiling on i.MX6 Quad yields wrong number of read bytes

christophfreund · ‎11-24-2016

Hi,

during performance measurements on our custom i.MX6 Quad board (using 32-bit DDR3 RAM and running a Linux 3.14 kernel with RT patches, Update: Toolchain (based on Yocto) uses a gcc 4.9.2 compiler) I stumbled upon a strange result which I cannot explain: it seems that MMDC profiling undoubtedly returns wrong values in the MMDC0_MADPSR4 register.

I used the following C program for my test:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

void convert(const float* input, uint16_t* output, const int size) {
    int i;
    for (i = 0; i < size; ++i) {
        output[i] = input[i] * 1000.0f;
    }
}

int main(int argc, char* argv[])
{
    const int size = (argc < 2) ? 32768 : atoi(argv[1]);
    const int niter = (argc < 3) ? 1 : atoi(argv[2]);
    int i;

    for (i = 0; i < niter; ++i) {
        float* input = (float*) malloc(size * sizeof(float));
        uint16_t* output = (uint16_t*) malloc(size * sizeof(uint16_t));

        convert(input, output, size);

        free(input);
        free(output);
    }

    return 0;
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note that I freshly allocate the memory before calling the convert function such that I can (in conjunction with a sufficiently large array size) rule out that we are working in the cache. In this test, I do not initialize the memory such that the whole input array should be read once and be written exactly once during the execution.

For the MMDC profiling itself I use the version of the mmdc tool that can be found here: GitHub - FrankBau/mmdc: Memory Bandwidth Measurement Tool for i.MX6

When calling the mmdc tool, I have set the environment variable MMDC_SLEEPTIME to 3 (seconds).

Profiling of the "idle" system yields this result:

MMDC new Profiling results:
***********************
Measure time: 3000ms 
Total cycles count: 1584049192
Busy cycles count: 9636657
Read accesses count: 230740
Write accesses count: 45605
Read bytes count: 6878948
Write bytes count: 1453322
Avg. Read burst size: 29
Avg. Write burst size: 31
Read: 2.19 MB/s /  Write: 0.46 MB/s  Total: 2.65 MB/s 
Utilization: 5%
Bus Load: 0%
Bytes Access: 30

A single run of my test program with array size 100,000,000 and only one iteration should read 400,000,000 bytes and write 200,000,000 bytes but instead it shows numbers like these:

MMDC new Profiling results:
***********************
Measure time: 3000ms 
Total cycles count: 1584044016
Busy cycles count: 187396205
Read accesses count: 1308446
Write accesses count: 6621328
Read bytes count: 39956968
Write bytes count: 211873984
Avg. Read burst size: 30
Avg. Write burst size: 31
Read: 12.70 MB/s /  Write: 67.35 MB/s  Total: 80.05 MB/s 
Utilization: 8%
Bus Load: 11%
Bytes Access: 31

The amount of bytes written is exactly what I expect but number of bytes read (roughly 40,000,000 bytes minus the idle load) is way below my expectancy. You can also vary the array size and the number of iterations, the result is always like that. I ruled out any overflows in the MMDC0_MADPSR4 register, which is responsible for count the total number of bytes read, by constantly monitoring it during the profile run.

My question is: can anybody reproduce this behavior on another i.MX6 Quad (or also Dual/DualLite/Solo) system? And in case it is not reproducible somwhere else, what could possibly go wrong in my system that could explain these profiling numbers?

Update: I ran the same test on the same board but using a different kernel (3.11 without RT patches) and a different toolchain (Yocto based using gcc 4.7.2 with the SoftFP ABI) for compilation. It yields the following numbers

MMDC new Profiling results:
***********************
Measure time: 3000ms
Total cycles count: 1584038592
Busy cycles count: 442998552
Read accesses count: 7119647
Write accesses count: 12846785
Read bytes count: 226276278
Write bytes count: 411094862
Avg. Read burst size: 31
Avg. Write burst size: 31
Read: 71.93 MB/s / Write: 130.68 MB/s Total: 202.61 MB/s
Utilization: 8%
Bus Load: 27%
Bytes Access: 31

The number of bytes read is now roughly 200,000,000 which is now significantly higher but still less than expected. Additionally, the number of bytes written has approximately doubled to about 400,000,000 bytes. I am aware that the used compiler influences the result and though I checked the generated code and did not see an additional write operation I could somehow accept the reported number of bytes written. However, it still does not make sense that according to the profiling data fewer bytes have been read than necessary in order to produce the correct result.

Am I making a wrong assumption anywhere? How can the differences in the profiling numbers between the two environments on the same board be explained? Any explanations or hints are welcome.

Update 2: Another test on a Nitrogen board (Kernel 3.10) with the same SoftFP built binaries of above results in

MMDC new Profiling results:
***********************
Measure time: 3000ms
Total cycles count: 1584077648
Busy cycles count: 125561174
Read accesses count: 524494
Write accesses count: 6532135
Read bytes count: 16401940
Write bytes count: 209027072
Avg. Read burst size: 31
Avg. Write burst size: 31
Read: 5.21 MB/s /  Write: 66.45 MB/s  Total: 71.66 MB/s
Utilization: 11%
Bus Load: 7%
Bytes Access: 31

The numbers are very similar to those of the "original" environment: the number of bytes written is as expected but - again - the number of bytes read is too low.

Many regards,

Christoph

frankba · ‎01-03-2017

Hi Christoph,

I can reproduce small read byte values using your program. This certainly has something to do with some clever optimization of malloc implementation. I added a line memset( input, 1, size * sizeof(float) ); before the call to convert and get realistic numbers:

# mmdc ARM &! ./convert 100000000
[1] 824
MMDC ARM 
root@marsboard:~# 
MMDC new Profiling results:
***********************
Measure time: 5001ms 
Total cycles count: 2640117768
Busy cycles count: 815230972
Read accesses count: 14243612
Write accesses count: 19325921
Read bytes count: 454107904
Write bytes count: 618429208
Avg. Read burst size: 31
Avg. Write burst size: 31
Read: 478.53 MB/s / Write: 117.93 MB/s Total: 204.53 MB/s 
Utilization: 8%
Bus Load: 30%
Bytes Access: 31

malloc does not allocate physical memory in SDRAM. To prove this, add an endless for(;;); loop immediately after malloc before the memory is accessed, compile and run the program. Check with free -mh for available system memory before the program is running and while the program is running. There should be barely any difference. You may even fork 100 instances:

# for i in {1..100} ; do ./convert 100000000 & done
# free -mh
 total used free shared buff/cache available
Mem: 1.0G 180M 749M 224K 72M 801M
Swap: 0B 0B 0B
# killall convert

Dump the underlying system calls:

# strace ./convert 100000000
...
mmap2(NULL, 400003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x5f04a000
mmap2(NULL, 200003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x5318d000

mmap2 is eventually called from malloc. See http://landley.net/writing/memory-faq.txt for an explanation of anonymous mappings.

Tested on a MarS Board (1 GiB iMX6D) with a yocto image Linux 4.1.15 and gcc 5.3.0

hth

Frank

Yuri · ‎12-02-2016

Hello

Is - in the case - applied the "enhance the mmdc tool to use bus width information
for utilization calculation so that it can support 16/32bit system properly" ?

I will send You it directly ; let me know e-mail or create request.

https://community.nxp.com/docs/DOC-329745

Have a great day,
Yuri

-----------------------------------------------------------------------------------------------------------------------
Note: If this post answers your question, please click the Correct Answer button. Thank you!
-----------------------------------------------------------------------------------------------------------------------

christophfreund · ‎12-06-2016

Hi Yuri,

I have received the patch and tried it together with the version of mmdc.c you have sent me, unfortunately without success. The patch only corrects the computation of the utilization percentage, however, I seem to have an issue already with the absolute number of read bytes as they are read from the register MMDC0_MADPSR4, and this is still showing the same (too low) value.

Many regards,

Christoph

MMDC profiling on i.MX6 Quad yields wrong number of read bytes

MMDC profiling on i.MX6 Quad yields wrong number of read bytes

i.MX6_All

i.MX6Quad