Parallelizing nested loops

Discussion created by Otterz on Feb 15, 2011
Latest reply on Feb 16, 2011 by Otterz

Edit: Wow, this forum completely mauled my post.



I am new to OpenCL, and I want to port some MPI code I have, as I am hoping to see a benefit from using the GPU.


The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.

My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah(i,j). But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

To parallelize i,j I just use an 2D NDRangeKernel like so:


    err = queue.enqueueNDRangeKernel(
        cl::NDRange((L+1), (L+1)),
    checkErr(err, "CommandQueue::enqueueNDRangeKernel()");


I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah(i,j), which I don't know how to do yet. I'm wondering am on the right track? Any suggestions?

I am learning OpenCL as I go, and my background is MPI.



for(int i = 0; i < L + 1; i++){ for(int j = 0; j < L + 1; j++){ for(int k = 0; k < L + 1; k++){ if some_conditionals for(int l = 0; l < L + 1; l++){ if some_conditionals G = 1.0; for(int m = 0; m < L + 1; m++){ if some_conditionals G = some_math; } // end M loop blah(i,j) += some_math; } // end l loop } // end k loop }// end j loop }// end i loop