4 Replies Latest reply on Sep 23, 2011 11:37 PM by notzed

    local memory performance question

    fajsc88
      operations on local memory affecting performance of copy to global mem space

      p { margin-bottom: 0.08in; }

      Using the kernel code located below, why does the number of respins affect the time required to copy the final results from local to global memory?  I'm running a single work-group and single work-item. The barrier is placed after all the respins have occurred. Wouldn't the calculation done on the local memory be done in cache, with one cache flush when the barrier statement is reached? Then the final copy out to global memory shouldn't be dependent on the number of respins.

      What I'm observing is:

      1. If omitting the final copy out to global memory, this kernel runs extremely fast, regardless of the number of respins.

      2. When the final copy to global memory is included, the kernel run time degrades as the number of respins is increased.

      Obvervation #2 doesn't make sense. Shouldn't the final copy out to global space take a fixed amount of time regardless of the number of respins?

      Thank you.

      __kernel void gpufunc(__global uint *input, __global uint *output,
                    __local uint *inl) {
          unsigned int n;
          unsigned int cnt = 0;

          /* Move input to local memory */
          for (n=0;n<256;n++) {
              inl[n] = input[n];
          }
          barrier(CLK_LOCAL_MEM_FENCE);

      respin:

          /* Do some calculations */
          for (n=0;n<256;n++) {
              inl[n] = inl[n] + 1;
          }

          if (++cnt < 0xFFF) goto respin;

          barrier(CLK_LOCAL_MEM_FENCE);

          /* move data back to global memory to be read by application */
              for(n=0; n < 256; n++)   {
                  output[n] = inl[n];
              }

          return;
      }

        • local memory performance question
          LeeHowes

          I would guess that the compiler is optimising the entire kernel away if you don't put that write back in. As a result you are seeing what you would hope to see: that the slowdown is in the fact that you're executing a longer loop in the middle.

          • local memory performance question
            genaganna

             

            Originally posted by: fajsc88 p { margin-bottom: 0.08in; }

            Using the kernel code located below, why does the number of respins affect the time required to copy the final results from local to global memory?  I'm running a single work-group and single work-item. The barrier is placed after all the respins have occurred. Wouldn't the calculation done on the local memory be done in cache, with one cache flush when the barrier statement is reached? Then the final copy out to global memory shouldn't be dependent on the number of respins.

            What I'm observing is:

             



            1. If omitting the final copy out to global memory, this kernel runs extremely fast, regardless of the number of respins.

               



            2. When the final copy to global memory is included, the kernel run time degrades as the number of respins is increased.

               



            Obvervation #2 doesn't make sense. Shouldn't the final copy out to global space take a fixed amount of time regardless of the number of respins?

            Thank you.

            __kernel void gpufunc(__global uint *input, __global uint *output,               __local uint *inl) {     unsigned int n;     unsigned int cnt = 0;     /* Move input to local memory */     for (n=0;n<256;n++) {         inl[n] = input[n];     }     barrier(CLK_LOCAL_MEM_FENCE); respin:     /* Do some calculations */     for (n=0;n<256;n++) {         inl[n] = inl[n] + 1;     }     if (++cnt < 0xFFF) goto respin;     barrier(CLK_LOCAL_MEM_FENCE);     /* move data back to global memory to be read by application */         for(n=0; n < 256; n++)   {             output[n] = inl[n];         }     return; }

            On which device you are running?

            Atleast you put one dummy statement like output[0] = 10 for first case. after that both will behave same way.

              • local memory performance question
                fajsc88

                Thanks for responding, much appreciated.  Yes, I tried writing a single value in place of the final loop that copies to global memory.  For example, I did:

                output[0] = 0xFF;

                and I tried:

                output[0] = inl[0];

                When writing a constant value, the performance was good.  When writing a single value from local memory, the performance was identiical to copying the entire array from local to global.  Maybe the compiler is still smart enough to optimize out all the calculations on inl[]. 

                I'm running this on Linux.  My next step is to setup a Windows workstation, where the profiling tools can be used.  This should help identify the root cause.  Unfortunately the AMD tools provided for Linux are a subset of what they provide for Windows.

              • local memory performance question
                notzed

                 

                Originally posted by: fajsc88

                What I'm observing is:

                 



                1. If omitting the final copy out to global memory, this kernel runs extremely fast, regardless of the number of respins.

                   



                2. When the final copy to global memory is included, the kernel run time degrades as the number of respins is increased.

                   



                Obvervation #2 doesn't make sense. Shouldn't the final copy out to global space take a fixed amount of time regardless of the number of respins?

                 



                2. You're timing the calculations + global memory copy.  So more calculations should take longer ... shouldn't they?   GPU's are fast, but they still take time to run calculations ...

                'local memory' is not a cache.  If you're writing to local memory, the writes still go to the local memory - the data (in this kernel) is either in registers or in local memory, there's no where else for it to go. (assuming no register spillage)

                So even without the barrier the loop will have to write the partial answers to local memory anyway since it can't fit everything in registers.  And of course, a barrier doesn't really do anything with a single thread.

                Looking at an assembly dump might be more useful than a profiling tool.  At least you can tell if the loop has been compiled out.   Be wary of making your test cases too simple - compilers are fairly good at constant elimination so you can find your whole routine vanishes.