kopfrechner

#define faster than argument or private varible

Discussion created by kopfrechner on Aug 30, 2011
Latest reply on Aug 31, 2011 by kopfrechner

Hello everyone,

does someone know, why there is such a performance difference between the following kernels? Please focus on the "NR_OF_LOOPS" or "nr_of_loops" in the for loop. (NOTE: kernels are simplified!)

#define NR_OF_LOOPS (256)

// fast
__kernel void my_kernel_1 (... args ...) {
   // do something
   for (size_t i = 0; i < NR_OF_LOOPS; i++) {
      // do something
   }
   // do something
}

// slow
__kernel void my_kernel_2 (unsigned nr_of_loops) {
   // do something
   for (size_t i = 0; i < nr_of_loops; i++) {
      // do something
   }
   // do something
}

//slow
__kernel void my_kernel_3 (... args ...) {

   size_t nr_of_loops = 256;

   // do something
   for (size_t i = 0; i < nr_of_loops; i++) {
      // do something
   }
   // do something
}

What i need is an implementation like my_kernel_2 or my_kernel_3, where nr_of_loops is either an argument or an variable in private memory. But my_kernel_1 is about factor 2 faster, than my_kernel_2 or my_kernel_3. Does someone know how to get my_kernel_2 or my_kernel_3 faster of even as fast as my_kernel_1? I also appriciate all other ideas.

Thank you so far.

Outcomes