Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

for(int k = 0; k < N-1; k++)

{

for(int i = k+1; i < N; i++)

S[i*N + k] = S[i*N + k] / S[k*N + k];

for(int j = k+1; j < N; j++)

for(int i = k+1; i < N; i++)

S[i*N + j] -= S[i*N + k] * S[k*N + j];

}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

int IDx = get_global_id(0);

int IDy = get_global_id(1);

for(int k = 0; k < n-1; k++)

{

barrier(CLK_GLOBAL_MEM_FENCE);

if(IDy > k && IDx == k)

matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

barrier(CLK_GLOBAL_MEM_FENCE);

for(int j = k+1; j < n; j++)

{

if(IDy > k && IDx == j)

matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];

}

}

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

The factorization algorithms you're using are hard to do on a GPU because you can't synchronize across work groups. Barriers only synchronize among elements within a work group. IIRC with Cholesky, the main problem is the race condition between updating a column and doing the divides on that column. You need to make sure all updates on a column have completed before you do the division. The problem should be the same with LU, given that they're very similar.

You'll want to look at using recursive algorithms that use BLAS primitives. This comes to mind:

http://www.hpca.uji.es/ficheros/figual/ppopp09.pdf