AnsweredAssumed Answered

OpenCL kernel optimizing

Question asked by xerxis on Oct 14, 2012
Latest reply on Nov 14, 2012 by binying

So I recently started using OpenCL to do calculations, and while it is already a lot faster then the same code on CPU I'm still a bit worried about performance. Certainly after reading about coalescent memory.

 

I have a kernel for the following problem: I need to delay elements in a array. Let's say for examples I have an array that consists of 5 elements {1, 2, 3, 4, 5} and a delay value of 2, that would produce a resultant array {1, 1, 1, 2, 3}. There are also negative delays, the resultant array of negative delay -2 would be {2, 3, 5, 5, 5}.

 

Now, I have a lot of arrays that need this treatment and instead of sending them one by one and calling the kernel everytime I merge them all together in a long 1D array and call the kernel only once. One snag, every array has another delay.

So, let's again assume two arrays with 5 elements (stride 5) made into one larger array { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, the first array has a delay of 2 and the second a delay of -2. So I have an array of delays that consists of { 2, - 2}. My kernel should thus produce { 1, 1, 1, 2, 3, 8, 9, 10, 10, 10 }.

 

I hope this is an adequate explanation of what I want to achieve, in reality the stride is a much larger number 6668 and there are a few dozen arrays. Also this is only one step in a much larger calculation, so the necessary data is already in the device memory. The kernel I wrote for this goes as follows:

 

[code]

__kernel void delay_signal(__global double* input, __global double* output, __global int* delays, const unsigned int stride, const unsigned int count)

{

          int i = get_global_id(0);

 

          if (i < count)

          {

                    int delaysindex = i / stride;

                    int delay = delays[delaysindex];

 

                    if (delay == 0)

                    {

                              output[i] = input[i];

                    }

                    else

                    {

                              if (abs(delay) >= stride) delay = stride - 1;

 

                              int startindex = delaysindex * stride;

 

                              if (delay > 0)

                              {

                                        int delaystart = startindex + delay;

 

                                        if (i < delaystart)

                                                  output[i] = input[startindex];

                                        else

                                                  output[i] = input[i - delay];

                              }

                              else

                              {

                                        int endindex = startindex + stride - 1;

                                        int delaystart = endindex + delay;

 

                                        if (i < delaystart)

                                                  output[i] = input[i - delay];

                                        else

                                                  output[i] = input[endindex];

                              }

                    }

          }

};

 

[/code] (how do i paste code?)

 

Which does the job, but I realize the memory access is probably subpar. I just don't have the experience yet to know how to improve it. Should I first copy the delays to local memory? But if I do, what's the advantage, I still need to read it from global memory first every work item? Also how can I improve the way I read from input. I know the way I do it know is not coalescent.

 

If somebody could help me with this and give me some tips on how to optimize these kind of kernel I would be very grateful.

Outcomes