cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

xerxis
Journeyman III

OpenCL kernel optimizing

So I recently started using OpenCL to do calculations, and while it is already a lot faster then the same code on CPU I'm still a bit worried about performance. Certainly after reading about coalescent memory.

I have a kernel for the following problem: I need to delay elements in a array. Let's say for examples I have an array that consists of 5 elements {1, 2, 3, 4, 5} and a delay value of 2, that would produce a resultant array {1, 1, 1, 2, 3}. There are also negative delays, the resultant array of negative delay -2 would be {2, 3, 5, 5, 5}.

Now, I have a lot of arrays that need this treatment and instead of sending them one by one and calling the kernel everytime I merge them all together in a long 1D array and call the kernel only once. One snag, every array has another delay.

So, let's again assume two arrays with 5 elements (stride 5) made into one larger array { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, the first array has a delay of 2 and the second a delay of -2. So I have an array of delays that consists of { 2, - 2}. My kernel should thus produce { 1, 1, 1, 2, 3, 8, 9, 10, 10, 10 }.

I hope this is an adequate explanation of what I want to achieve, in reality the stride is a much larger number 6668 and there are a few dozen arrays. Also this is only one step in a much larger calculation, so the necessary data is already in the device memory. The kernel I wrote for this goes as follows:

__kernel void delay_signal(__global double* input, __global double* output, __global int* delays, const unsigned int stride, const unsigned int count)

{

          int i = get_global_id(0);

 

          if (i < count)

          {

                    int delaysindex = i / stride;

                    int delay = delays[delaysindex];

 

                    if (delay == 0)

                    {

                              output = input;

                    }

                    else

                    {

                              if (abs(delay) >= stride) delay = stride - 1;

 

                              int startindex = delaysindex * stride;

 

                              if (delay > 0)

                              {

                                        int delaystart = startindex + delay;

 

                                        if (i < delaystart)

                                                  output = input[startindex];

                                        else

                                                  output = input[i - delay];

                              }

                              else

                              {

                                        int endindex = startindex + stride - 1;

                                        int delaystart = endindex + delay;

 

                                        if (i < delaystart)

                                                  output = input[i - delay];

                                        else

                                                  output = input[endindex];

                              }

                    }

          }

};

(how do i paste code?)

Which does the job, but I realize the memory access is probably subpar. I just don't have the experience yet to know how to improve it. Should I first copy the delays to local memory? But if I do, what's the advantage, I still need to read it from global memory first every work item? Also how can I improve the way I read from input. I know the way I do it know is not coalescent.

If somebody could help me with this and give me some tips on how to optimize these kind of kernel I would be very grateful.

0 Likes
4 Replies
notzed
Challenger

Looks like your memory accesses are coalescible[sic], so I wouldn't be concerned about that.

Your code is pretty poor though!

You should be able to re-arrange it to be much simpler and much more readable, say something like this:

int delay = delays[]; // whatever, as you have now

int startindex = ...;

// simpler algorithm to do the same thing

int j = startindex + clamp(i+delay, 0, stride-1);

output = input;

(although in reality i don't know if this would make much difference to execution time as the compiler will do some optimising -  but it will be easier to read and understand).

Actually i'd say the code is so simple you'd be better off ditching the kernel entirely and just doing this calculation where you need to use it's result.  That'll save you a redundant copy of the data, and the overhead of a kernel call.

0 Likes

Thanks, but using fmax, fmin or clamp works on my AMD desktop but makes the nvidia compiler crash on my laptop.

Also, I tried doing the calculation in place but it was slower. The reason I do this relatively easy calculation on the GPU is because the data is already on there, and I don't want to read from and write to the host more than necessary.

0 Likes

Nvm the crashing of clamp, I guess you can't mix int with unsigned int

0 Likes
binying
Challenger

There are many "if-else" in your code. Can you decrease the number of those, say, by using "switch".

0 Likes