Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Inner looping structure with OpenCL


I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

    for(int k = 0; k < N-1; k++)
        for(int i = k+1; i < N; i++)
            S[i*N + k] = S[i*N + k] / S[k*N + k];

        for(int j = k+1; j < N; j++)
            for(int i = k+1; i < N; i++)
                S[i*N + j] -= S[i*N + k] * S[k*N + j];

I have done with the simple opencl kernel with single work items (no groping). Thats following:

      int IDx = get_global_id(0);
      int IDy = get_global_id(1);

      for(int k = 0; k < n-1; k++)

        if(IDy > k && IDx == k)
            matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];


        for(int j = k+1; j < n; j++)
            if(IDy > k && IDx == j)
                matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];


But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

4 Replies
Adept II

The factorization algorithms you're using are hard to do on a GPU because you can't synchronize across work groups. Barriers only synchronize among elements within a work group. IIRC with Cholesky, the main problem is the race condition between updating a column and doing the divides on that column. You need to make sure all updates on a column have completed before you do the division. The problem should be the same with LU, given that they're very similar.

You'll want to look at using recursive algorithms that use BLAS primitives. This comes to mind:


For this synchronization problem, what if I use another matrix to put division step values into and then pick out values from there while doing update step like;

(division step)

Temp[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

(update step)

matrix[IDy*n + IDx] -= Temp[IDy*n + k] * matrix[k*n + IDx];

This too gives wrong output though, but here it shouldnt be a synchronization problem then? I jus want to know if anything else is wrong in my code because I am very new to this kind of code...


Or what if I make a single work group whose size is equal to whole global size domain, then barrier must be able to synchronize between division and update stages?


yes only one workgroup is possible. but then you utilize only 1/10-1/20 GPU power.