3 Replies Latest reply on Aug 31, 2011 7:28 AM by kopfrechner

    #define faster than argument or private varible

    kopfrechner

      Hello everyone,

      does someone know, why there is such a performance difference between the following kernels? Please focus on the "NR_OF_LOOPS" or "nr_of_loops" in the for loop. (NOTE: kernels are simplified!)

      #define NR_OF_LOOPS (256)

      // fast
      __kernel void my_kernel_1 (... args ...) {
         // do something
         for (size_t i = 0; i < NR_OF_LOOPS; i++) {
            // do something
         }
         // do something
      }

      // slow
      __kernel void my_kernel_2 (unsigned nr_of_loops) {
         // do something
         for (size_t i = 0; i < nr_of_loops; i++) {
            // do something
         }
         // do something
      }

      //slow
      __kernel void my_kernel_3 (... args ...) {

         size_t nr_of_loops = 256;

         // do something
         for (size_t i = 0; i < nr_of_loops; i++) {
            // do something
         }
         // do something
      }

      What i need is an implementation like my_kernel_2 or my_kernel_3, where nr_of_loops is either an argument or an variable in private memory. But my_kernel_1 is about factor 2 faster, than my_kernel_2 or my_kernel_3. Does someone know how to get my_kernel_2 or my_kernel_3 faster of even as fast as my_kernel_1? I also appriciate all other ideas.

      Thank you so far.

        • #define faster than argument or private varible
          settle

          The preprocessor use in my_kernel_1 allows the compiler to unroll the loop to achieve better performance.  You could try using #pragma unroll some_number_here right above the loop in my_kernel_2 and my_kernel_3, or manually unroll the loop some amount and remember that nr_of_loops my not be evenly divisible by the loop unroll factor.  Something similar can be done by explicitly using vector types.

          // Unroll loop iterations by a factor of 4 for (i = 0; i < nr_of_loops; i += 4) { z[i+0] = ... z[i+1] = ... z[i+2] = ... z[i+3] = ... } // Remaining loop iterations for (i = nr_of_loops ^ 0x3; i < nr_of_loops; ++i) { z[i] = ... }

          • #define faster than argument or private varible
            genaganna

             

            Originally posted by: kopfrechner Hello everyone,

             

            does someone know, why there is such a performance difference between the following kernels? Please focus on the "NR_OF_LOOPS" or "nr_of_loops" in the for loop. (NOTE: kernels are simplified!)

             

            #define NR_OF_LOOPS (256) // fast __kernel void my_kernel_1 (... args ...) {    // do something    for (size_t i = 0; i < NR_OF_LOOPS; i++) {       // do something    }    // do something } // slow __kernel void my_kernel_2 (unsigned nr_of_loops) {    // do something    for (size_t i = 0; i < nr_of_loops; i++) {       // do something    }    // do something } //slow __kernel void my_kernel_3 (... args ...) {

             

               size_t nr_of_loops = 256;

             

               // do something    for (size_t i = 0; i < nr_of_loops; i++) {       // do something    }    // do something }

             

            What i need is an implementation like my_kernel_2 or my_kernel_3, where nr_of_loops is either an argument or an variable in private memory. But my_kernel_1 is about factor 2 faster, than my_kernel_2 or my_kernel_3. Does someone know how to get my_kernel_2 or my_kernel_3 faster of even as fast as my_kernel_1? I also appriciate all other ideas.

             

            Thank you so far.

             

            For second kernel, make nr_of_loops as __constant uint*

            For Third kernel, make nr_of_loops as const size_t nr_of_loops = 256.

            After these changes,

            kernel 1 and 3 must be same and kernel 2 is slower than 1 and 3.