6 Replies Latest reply on Nov 11, 2009 5:01 PM by jyost

    memory accesses question

    jyost

      I'm kind of new to CodeAnalyst and I find that I frequently get results that I don't expect or have difficulty interpreting.

      Here's an example.  Consider the following little test program:

      #include <stdlib.h>
      #include <string.h>

      main()
      {
          size_t size = 1000000;
          int iters = 1000;

          unsigned char *buf = (unsigned char *)malloc (size);

          register unsigned char sum = 0;

          for (register int i = 0; i < iters; i++)
          {
          for (register int j = 0; j < size; j++)
              sum += buf[j];
          }

          return sum;
      }

      Compiled as follows:

      g++ -o simple simple.cpp

      (So - no optimization.)

      I would expect this to do ony reads, and no (or very few) writes to main memory.  In fact, if I look at events 0x6C (reads) and 0x6D (writes), it seems to do about as many reads as writes, if I'm interpreting the results correctly.  Hmmm ... Maybe "sum" isn't being put in a register, in spite of the "register" keyword.  That's the only theory I have.  But I'm not sure that I believe that.

      The actual results I got from one run were 7566 for reads, 31897 for writes and 3128 for DRAM accesses - all with a sample period of 10,000.  And ... hmmm ... maybe that sample period should be 500,000.  But, still ...

      Another question: Why is event 0xE0 (DRAM accesses) not equal to the sum of event 0x6C (reads) and 0x6D (writes)?

      What I'm ultimately trying to determine is if a real program (not the above test) is bumping up against memory bandwidth limits, but I'm not sure which event or events I should look at.  BTW - I have looked at Paul Drongowski's "Basic Performance Measurements ..." document, which is certainly very helpful, but still leaves me with some questions.  (Maybe I'm just thick!)

        • memory accesses question
          leiy

          CA 2.9.5.2-cg launch the app before starting the profile. That caused profile duration vary since the starting profile has delay with syscall.

          We will fix this issue.

          Thanks for reporting this.

          • memory accesses question
            pdrongowski

            Hi --

            You're probably not getting the assembler language code that you're
            expecting from the compiler. Here are some quick results using G++
            version 4.1.2 on SLES.

            The assembler language output was generated using the -S option.
            The first example was generated with the command:
                g++ -S -o simple simple.cpp
            The -S option asks the compiler to leave the intermediate assembler
            language file simple.s.

            -- pj

            P.S. I'll be sending two examples in the next replies. I'm just trying to keep each reply short.

             

              • memory accesses question
                pdrongowski

                As you mentioned, the default optimization level is -O0, no optimization.
                The keyword "register" is really a hint to the compiler that the variable
                will be frequently used. The compiler is free to use or ignore the hint.
                Since optimization is turned off, the compiler ignores the hint and
                allocates the variables into stack (memory) locations.

                I've annotated the assembler program with the variable to stack location
                bindings.

                 

                 

                *************************** -O0 optimization *************************** -8(%rbp) == Base address of the array (buf) -12(%rbp) == Outer loop bound 1000 (iters) -24(%rbp) == Inner loop bound 1000000 (size) -36(%rbp) == Inner loop counter (j) -40(%rbp) == Outer loop counter (i) -41(%rbp) == Sum of bytes (sum) .L3: movl $0, -36(%rbp) jmp .L4 .L5: movslq -36(%rbp),%rax addq -8(%rbp), %rax movzbl (%rax), %eax addb %al, -41(%rbp) addl $1, -36(%rbp) .L4: movslq -36(%rbp),%rax cmpq -24(%rbp), %rax jb .L5 addl $1, -40(%rbp) .L2: movl -40(%rbp), %eax cmpl -12(%rbp), %eax jl .L3

              • memory accesses question
                pdrongowski

                In the following case, optimization was turned on. The generated code
                is probably more in line with your expectations. Here the variables
                are bound to registers.

                 

                *************************** -O2 optimization *************************** rdi == Base address of the array (buf) r8d == Outer loop counter (i) rcx,rdx == Inner loop counter (j) .L2: xorl %esi, %esi movl $1, %edx jmp .L4 .L3: movq %rdx, %rsi movq %rcx, %rdx .L4: leaq 1(%rdx), %rcx addb (%rdi,%rsi), %al cmpq $1000001, %rcx jne .L3 addl $1, %r8d cmpl $1000, %r8d jne .L2

                  • memory accesses question
                    leiy

                    Based on BKDG (BIOS and Kernel Debug Guide), EventSelect 06Dh Octwords Written to System: The number of octword (16-byte) data transfers from the processor to the system. These may be part of a 64-byte cache line writeback or a 64-byte dirty probe hit response.

                    The counts of event, 06Dh, in the simple program, probably are due to dirty probe hit response.

                    You can look at the issue differently from Instruction-based sampling point of view -- setup IBS Op Sampling with "dispatch count" mode. You will find there is no store in the simpe program.

                     

                    • memory accesses question
                      jyost

                      Hi Paul -

                      First off - thanks for your responses.

                      But your responses only address the question of whether things are getting put into registers, no?   I believe that the code is still generating write events - according to CodeAnalyst - even when it's optimized.  So what I'm wondering is where those write events are coming from.  Lei Yu suggested that they're actually "dirty probe hit responses", rather than the result of writes that my code is doing.  I need to learn what dirty probe hit responses are, where they're coming from, and whether they should be considered when investigating memory bandwidth issues (I assume the answer to that would be "yes".)