Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Kernel works on cpu but not on gpu

Hi there. A few days ago I started writing some code in OpenCL (never used it before) and I have the following problem. I have a kernel which when executed in the CPU works fine but when it runs in the GPU it gives me false results and leads to a segmentation fault after a while.

You can see the kernel here:

What especially gives me the problem is the last small loop which updates the values. Why can this happen? What am I doing wrong?

8 Replies
Adept II

Compile your kernel with -g and run valgrind on it. See if you're getting any memory errors in your kernel when you run it on the CPU. You'll get a ton from AMD's OpenCL itself, but you can ignore those. Looking at the kernel alone, it appears correct. Then run it with the GPU and see if you get any errors.

You might want to check that sizeof() your structs are the same on the CPU and GPU. If they aren't you might not be allocating enough memory for them on the GPU.

As a rule of thumb, structs of arrays are a big no-no on GPUs (especially older ones). I would change your t_speed structs into an array of 9*count doubles. Lay out the memory so that count is in the leading dimension and 9 is the trailing. This way, you get coalescing withing a workgroup.

Also, unroll your d_equ and u arrays into 9 variables each. It looks like the compiler could move these out of global memory and into registers due to the known trip counts on the loops, but I don't like to my faith in such things. You'll have to fully unroll the for loops too, but it's only 9 iterations so that shouldn't be too bad.

These suggestions I don't think directly address the problem, because your code looks correct as far as I can tell, but they do make your code friendlier to the compiler and GPU architecture which might fix your problem.

First of all thank you for your reply. It gave me some thinks to think about.

I'm trying to figure out if valgrind is available in the system (since it is not under my supervision). Meanwhile I tried to test the sizes of the structs as you suggested but they are the same in the CPU and GPU. I have also tried to make u and d_equ arrays seperate variables but it didn't make any difference (I know that only that wouldn't, I was just experimenting).

The hardware I'm trying to run the code is a Tesla M2050. I don't think it's so old to give me trouble (correct me if I'm wrong).

As for the memory spreading I know that it will help me a lot in terms of speed and efficiency but to be honest right now I don't have the time to alter the whole code to make it compatible with such a change and I want that to be my last solution. The thing is that there are more kernels that are being executed before this one and they require to take under consideration the values of some of the speeds[] in a t_speed object to update again some of them - probably here that would require the usage of memory fences to avoid updating some values before they are changed by another work unit.


The 2050 is a Fermi card and I don't think its architecture should give you problems from a correctness standpoint.

You might have a bug in your frontend code that does the allocation and transfers. I've had programs act very strangely because I copied too much data from the GPU to CPU host buffer and clobbered stuff.

Finally, I've seen compiler bugs in both AMD's and Nvidia's OpenCL compilers and you might be hitting one. They're far rarer than they used to be though. Still, I recall many of these bugs centering around using structs. This reason combined with coalescing issues and potential host/device packing/alignment discrepancies is why I almost always stick with arrays unless it's something simple like an overlay on a bunch of floats or unsigned ints used for information about a dataset.

As sad a state of affairs as it is, commenting things out and putting clFinish() statements after everything are among the most effective debugging tools for OpenCL. Running the kernel through gdb on the host works, but if it's giving you correct answers, I doubt it will help you much.

Do you have access to an AMD GPU to try it on?

If this kernel is just a piece of the overall application, then I probably wouldn't bother shifting memory around either. I don't really see why this would introduce the need for memory fences though...


Sadly I don't have access to an AMD GPU. About the bug in the host code, shouldn't it appear when the OpenCL code runs on the CPU as well though?

As we speak I'm trying to change the code to make it work without structs. About the memory fences what I mean is how to process something like this. If I got it right it should be changed to this. (Don't mind the t_param struct, I will change this one as well).

If it is a compiler bug then I will be very unlucky...


I tried to spread out the memory as you suggested. Now the only struct that I have is one t_param which is only one struct, not an array of them. I will remove it as well.

What happened when I tried to execute the code like this is the following:

  • In the GPU I got again a segmentation fault at iteration 1644 while when I used structs I got it on iteration 501 ( total iterations are 10.000)
  • In the CPU I got an INVALID_WORK_GROUP_SIZE. I'm looking into it right now. The way I initialized the group sizes is the following:

    // n is the number of cells in the array as it was with the struct.

    // So the new array of doubles is of size n * 9

    size_t   global[2] = {n, 9};

    size_t   local[2]  = {1, 9};


There is something wrong now that I did it like that.

First of all no value gets updated since I tried to set a constant value to all the cells and when I read them back to host they were unchanged...

I guess it has to do with the group dimensions and how I read/use them(?). Is the way I'm doing it correct? (A sample kernel is in the post above this in the second link)

What is weird though is that I still get a segmentation fault; even if the values are not updated.


Just noticed that I forgot to multiply by 9 the pos to get to the right cell. I changed it but still nothing changed. The previous edit problem still exists.


In your second kernel, you're using the barrier incorrectly. Firstly, a local memory fence is a fence on local memory, which you aren't using. Secondly, it looks possible that some threads will hit the barrier while others won't because you a) put it inside a branch and b) can have threads return early and never hit the barrier due to your if() return;. All threads in a workgroup must hit the barrier. Otherwise, you can get a deadlock.

Have you put clFinish() statements after every transfer and kernel? That will let you see exactly where your seg fault is happening when you call backtrace in gdb.

Finally, you have fairly wonky work dimensions. I would do:

//Round up to the nearest multiple of 64

size_t global = n % 64 == 0 ? n : (n / 64 + 1) * 64;

size_t local = 64;

This makes each thread responsible for updating all 9 values of exactly 1 position.

__kernel void collision(__global const      t_param* params,
                    __global            double* cells,
                    __global const      double* tmp_cells,  
                    __global const      int* obstacles,  
                             const unsigned int  count)

  int pos = get_global_id(0);   

  if(pos >= count) return;

  int ii, jj, kk;   

  const double c_sq = 1.0 / 3.0;

  const double w0 = 4.0 / 9.0;  

  const double w1 = 1.0 / 9.0;  

  const double w2 = 1.0 / 36.0; 

  double u_x, u_y, u[9], d_equ[9], u_sq, local_density;

  double t1, t2, t3;

cells[pos + count] = tmp_cells[pos + 3 * count];   

cells[pos + 2 * count] = tmp_cells[pos + 4 * count];   

cells[pos + 3 * count] = tmp_cells[pos + count];   

cells[pos + 4 * count] = tmp_cells[pos + 2 * count];   

cells[pos + 5 * count] = tmp_cells[pos + 7 * count];   

cells[pos + 6 * count] = tmp_cells[pos + 8 * count];   

cells[pos + 7 * count] = tmp_cells[pos + 5 * count];   

cells[pos + 8 * count] = tmp_cells[pos + 6 * count];   

local_density = 0.0; 

for(kk = 0; kk < 9; kk++)   

  local_density += tmp_cells[pos].speeds[kk];

                     //change each [pos].speeds to [pos + count * k]

u_x = (tmp_cells[pos].speeds[1] + tmp_cells[pos].speeds[5] +
       tmp_cells[pos].speeds[8] - ( tmp_cells[pos].speeds[3] +   
        tmp_cells[pos].speeds[6] + tmp_cells[pos].speeds[7]))
      / local_density;

u_y = (tmp_cells[pos].speeds[2] + tmp_cells[pos].speeds[5] +
       tmp_cells[pos].speeds[6] - ( tmp_cells[pos].speeds[4] +   
        tmp_cells[pos].speeds[7] + tmp_cells[pos].speeds[8]))
      / local_density;

u_sq = u_x * u_x + u_y * u_y;

u[1] =   u_x  ;  

u[2] =     u_y;  

u[3] = - u_x  ;  

u[4] =   - u_y;  

u[5] =   u_x + u_y;  

u[6] = - u_x + u_y;  

u[7] = - u_x - u_y;  

u[8] =   u_x - u_y;  

t1 = 2.0 * c_sq;

d_equ[0] = w0 * local_density * (1.0 - u_sq / t1); 

t3 = w1 * local_density;

t2 = t1 * c_sq; 

t1 = u_sq / t1; 

d_equ[1] = t3 * (1.0 + u[1] / c_sq + (u[1] * u[1]) / t2 - t1);   

d_equ[2] = t3 * (1.0 + u[2] / c_sq + (u[2] * u[2]) / t2 - t1);   

d_equ[3] = t3 * (1.0 + u[3] / c_sq + (u[3] * u[3]) / t2 - t1);   

d_equ[4] = t3 * (1.0 + u[4] / c_sq + (u[4] * u[4]) / t2 - t1);   

t3 = w2 * local_density;

d_equ[5] = t3 * (1.0 + u[5] / c_sq + (u[5] * u[5]) / t2 - t1);   

d_equ[6] = t3 * (1.0 + u[6] / c_sq + (u[6] * u[6]) / t2 - t1);   

d_equ[7] = t3 * (1.0 + u[7] / c_sq + (u[7] * u[7]) / t2 - t1);   

d_equ[8] = t3 * (1.0 + u[8] / c_sq + (u[8] * u[8]) / t2 - t1);   

for(kk = 0; kk < 9; kk++)   

  cells[pos count * kk] = (tmp_cells[pos + count * kk] + params->omega *
       (d_equ[kk] - tmp_cells[pos + count * kk]));   




In the code you wrote I can't understand how you get the current index. count as I use it is the number of cells in the array of t_speed structs. Maybe the name is a little bit misleading and you used it incorrectly or I can't understand how your solution will work.

I do have clFinish statements after each kernel and after the data transfers as you say and I'm trying to locate the problem.


As you had things laid out initially using structs, your data was laid out like this column major matrix (assuming you didn't have additional data fields):


Where 1...9 are contiguous in memory. When you call get_global_id(0), this is the column index of this 9 x count matrix. The performance issue with this is that threads within a work group do not coalesce accesses.

If you instead do this:


You have a count x 9 row major matrix. get_global_id(0) indexes into the row of this column major matrix. Since each column is contiguous in memory, you get coalescing when each thread within a thread block accesses an element in this matrix.

2D arrays aren't real things, so you have to map the 2D index into a 1D index. The formula for doing this is row + col * numRows.