Hello everyone,
does someone know, why there is such a performance difference between the following kernels? Please focus on the "NR_OF_LOOPS" or "nr_of_loops" in the for loop. (NOTE: kernels are simplified!)
#define NR_OF_LOOPS (256)
// fast
__kernel void my_kernel_1 (... args ...) {
// do something
for (size_t i = 0; i < NR_OF_LOOPS; i++) {
// do something
}
// do something
}
// slow
__kernel void my_kernel_2 (unsigned nr_of_loops) {
// do something
for (size_t i = 0; i < nr_of_loops; i++) {
// do something
}
// do something
}
//slow
__kernel void my_kernel_3 (... args ...) {
size_t nr_of_loops = 256;
// do something
for (size_t i = 0; i < nr_of_loops; i++) {
// do something
}
// do something
}
What i need is an implementation like my_kernel_2 or my_kernel_3, where nr_of_loops is either an argument or an variable in private memory. But my_kernel_1 is about factor 2 faster, than my_kernel_2 or my_kernel_3. Does someone know how to get my_kernel_2 or my_kernel_3 faster of even as fast as my_kernel_1? I also appriciate all other ideas.
Thank you so far.
The preprocessor use in my_kernel_1 allows the compiler to unroll the loop to achieve better performance. You could try using #pragma unroll some_number_here right above the loop in my_kernel_2 and my_kernel_3, or manually unroll the loop some amount and remember that nr_of_loops my not be evenly divisible by the loop unroll factor. Something similar can be done by explicitly using vector types.
// Unroll loop iterations by a factor of 4 for (i = 0; i < nr_of_loops; i += 4) { z[i+0] = ... z[i+1] = ... z[i+2] = ... z[i+3] = ... } // Remaining loop iterations for (i = nr_of_loops ^ 0x3; i < nr_of_loops; ++i) { z = ... }
thanks, unrolling my loop 4 times safed me a lot of time!
also thanks to genaganna.
Originally posted by: kopfrechner Hello everyone,
does someone know, why there is such a performance difference between the following kernels? Please focus on the "NR_OF_LOOPS" or "nr_of_loops" in the for loop. (NOTE: kernels are simplified!)
#define NR_OF_LOOPS (256) // fast __kernel void my_kernel_1 (... args ...) { // do something for (size_t i = 0; i < NR_OF_LOOPS; i++) { // do something } // do something } // slow __kernel void my_kernel_2 (unsigned nr_of_loops) { // do something for (size_t i = 0; i < nr_of_loops; i++) { // do something } // do something } //slow __kernel void my_kernel_3 (... args ...) {
size_t nr_of_loops = 256;
// do something for (size_t i = 0; i < nr_of_loops; i++) { // do something } // do something }
What i need is an implementation like my_kernel_2 or my_kernel_3, where nr_of_loops is either an argument or an variable in private memory. But my_kernel_1 is about factor 2 faster, than my_kernel_2 or my_kernel_3. Does someone know how to get my_kernel_2 or my_kernel_3 faster of even as fast as my_kernel_1? I also appriciate all other ideas.
Thank you so far.
For second kernel, make nr_of_loops as __constant uint*
For Third kernel, make nr_of_loops as const size_t nr_of_loops = 256.
After these changes,
kernel 1 and 3 must be same and kernel 2 is slower than 1 and 3.