4 Replies Latest reply on Nov 14, 2012 11:29 AM by binying

    OpenCL kernel optimizing

    xerxis

      So I recently started using OpenCL to do calculations, and while it is already a lot faster then the same code on CPU I'm still a bit worried about performance. Certainly after reading about coalescent memory.

       

      I have a kernel for the following problem: I need to delay elements in a array. Let's say for examples I have an array that consists of 5 elements {1, 2, 3, 4, 5} and a delay value of 2, that would produce a resultant array {1, 1, 1, 2, 3}. There are also negative delays, the resultant array of negative delay -2 would be {2, 3, 5, 5, 5}.

       

      Now, I have a lot of arrays that need this treatment and instead of sending them one by one and calling the kernel everytime I merge them all together in a long 1D array and call the kernel only once. One snag, every array has another delay.

      So, let's again assume two arrays with 5 elements (stride 5) made into one larger array { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, the first array has a delay of 2 and the second a delay of -2. So I have an array of delays that consists of { 2, - 2}. My kernel should thus produce { 1, 1, 1, 2, 3, 8, 9, 10, 10, 10 }.

       

      I hope this is an adequate explanation of what I want to achieve, in reality the stride is a much larger number 6668 and there are a few dozen arrays. Also this is only one step in a much larger calculation, so the necessary data is already in the device memory. The kernel I wrote for this goes as follows:

       

      [code]

      __kernel void delay_signal(__global double* input, __global double* output, __global int* delays, const unsigned int stride, const unsigned int count)

      {

                int i = get_global_id(0);

       

                if (i < count)

                {

                          int delaysindex = i / stride;

                          int delay = delays[delaysindex];

       

                          if (delay == 0)

                          {

                                    output[i] = input[i];

                          }

                          else

                          {

                                    if (abs(delay) >= stride) delay = stride - 1;

       

                                    int startindex = delaysindex * stride;

       

                                    if (delay > 0)

                                    {

                                              int delaystart = startindex + delay;

       

                                              if (i < delaystart)

                                                        output[i] = input[startindex];

                                              else

                                                        output[i] = input[i - delay];

                                    }

                                    else

                                    {

                                              int endindex = startindex + stride - 1;

                                              int delaystart = endindex + delay;

       

                                              if (i < delaystart)

                                                        output[i] = input[i - delay];

                                              else

                                                        output[i] = input[endindex];

                                    }

                          }

                }

      };

       

      [/code] (how do i paste code?)

       

      Which does the job, but I realize the memory access is probably subpar. I just don't have the experience yet to know how to improve it. Should I first copy the delays to local memory? But if I do, what's the advantage, I still need to read it from global memory first every work item? Also how can I improve the way I read from input. I know the way I do it know is not coalescent.

       

      If somebody could help me with this and give me some tips on how to optimize these kind of kernel I would be very grateful.

        • Re: OpenCL kernel optimizing
          notzed

          Looks like your memory accesses are coalescible[sic], so I wouldn't be concerned about that.

           

          Your code is pretty poor though!

           

          You should be able to re-arrange it to be much simpler and much more readable, say something like this:

           

          int delay = delays[]; // whatever, as you have now

          int startindex = ...;

           

          // simpler algorithm to do the same thing

          int j = startindex + clamp(i+delay, 0, stride-1);

           

          output[i] = input[j];

           

          (although in reality i don't know if this would make much difference to execution time as the compiler will do some optimising -  but it will be easier to read and understand).

           

          Actually i'd say the code is so simple you'd be better off ditching the kernel entirely and just doing this calculation where you need to use it's result.  That'll save you a redundant copy of the data, and the overhead of a kernel call.

          • Re: OpenCL kernel optimizing
            binying

            There are many "if-else" in your code. Can you decrease the number of those, say, by using "switch".