13 Replies Latest reply on Feb 16, 2011 10:24 AM by Otterz

    Parallelizing nested loops

    Otterz

      Edit: Wow, this forum completely mauled my post.

      Hi,

       

      I am new to OpenCL, and I want to port some MPI code I have, as I am hoping to see a benefit from using the GPU.

       

      The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.

      My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah(i,j). But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

      In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

      To parallelize i,j I just use an 2D NDRangeKernel like so:

      [code]

          err = queue.enqueueNDRangeKernel(
              kernel,
              cl::NullRange,
              cl::NDRange((L+1), (L+1)),
              cl::NDRange(1,1),
              NULL,
              &event);
          checkErr(err, "CommandQueue::enqueueNDRangeKernel()");

      [/code]

      I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah(i,j), which I don't know how to do yet. I'm wondering am on the right track? Any suggestions?

      I am learning OpenCL as I go, and my background is MPI.

       

      Thanks!


      for(int i = 0; i < L + 1; i++){ for(int j = 0; j < L + 1; j++){ for(int k = 0; k < L + 1; k++){ if some_conditionals for(int l = 0; l < L + 1; l++){ if some_conditionals G = 1.0; for(int m = 0; m < L + 1; m++){ if some_conditionals G = some_math; } // end M loop blah(i,j) += some_math; } // end l loop } // end k loop }// end j loop }// end i loop

        • Parallelizing nested loops
          genaganna

           

          Originally posted by: Otterz Edit: Wow, this forum completely mauled my post.

           

          Hi,

           

           I am new to OpenCL, and I want to port some MPI code I have, as I am hoping to see a benefit from using the GPU.

           

           The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.

           

          My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah(i,j). But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

           

          In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

           

          To parallelize i,j I just use an 2D NDRangeKernel like so:

           

          [code]

           

              err = queue.enqueueNDRangeKernel(         kernel,         cl::NullRange,         cl::NDRange((L+1), (L+1)),         cl::NDRange(1,1),         NULL,         &event);     checkErr(err, "CommandQueue::enqueueNDRangeKernel()");

           

          [/code]

           

          I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah(i,j), which I don't know how to do yet. I'm wondering am on the right track? Any suggestions?

           

          I am learning OpenCL as I go, and my background is MPI.

           

           

           

          Thanks!

           

           

          Going for 3D is a good idea.  make sure you are using local work group 256 to solve l, m loops. 

          what is value of L. If value of L is very big and two many memory accesses in l, m loop, you might need to go for more than one kernel.

          It would be good if you give original MPI code

            • Parallelizing nested loops
              Otterz

              The MPI code is just the above nested loops with the outer loop only looping over a portion. In MPI I have the 2D array partitioned column wise, where each proc owns L rows, but only n columns.

              As of now, I am queuing L+1 kernels, using NDRangeKernel to spawn off all combos of i and j, and fixing k as a parameter.

              I am testing with L around 40-100, but I am targeting solutions with L around 2000.

              I don't understand what you mean about work groups though. I really don't get the thread block concept. In my codes I usually use row or column partitioning since I work mainly with sparse linear solvers (I've only ever needed tiling for dense matrix operations)

                • Parallelizing nested loops
                  Otterz

                  On a side note, I see a huge spike in cpu utilization (and system lag) if I call queue.finish, or queue.enqueueWaitForEvents. This is observed in all test with L=50, and very noticable with L=100.

                  To try to reduce the spike in CPU, I added logic in the loop queuing kernels to queue the next kernel, and then wait on the previously queued kernel. So it always waits on the i-1 kernel before queuing the i+1 kernel. But all that seems to do is spread out the CPU spike... why does it spike when waiting on 100 events? Is it unusal to queue 100 kernels?

                  • Parallelizing nested loops
                    genaganna

                     

                    Originally posted by: Otterz The MPI code is just the above nested loops with the outer loop only looping over a portion. In MPI I have the 2D array partitioned column wise, where each proc owns L rows, but only n columns.

                     

                    As of now, I am queuing L+1 kernels, using NDRangeKernel to spawn off all combos of i and j, and fixing k as a parameter.

                     

                    I am testing with L around 40-100, but I am targeting solutions with L around 2000.

                     

                    I don't understand what you mean about work groups though. I really don't get the thread block concept. In my codes I usually use row or column partitioning since I work mainly with sparse linear solvers (I've only ever needed tiling for dense matrix operations)

                     

                    I am not sure wheter your application runs or not when L is increased.

                    Please read atleast chapter 1 of AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf to understand work group concept.

                      • Parallelizing nested loops
                        Otterz

                        Well I know it runs, to test it, I just have it save the row major index in the blah(i,j) address (i*NCOLS + j) on the device (GPU or CPU both work fine), and then I read the buffer on the host. And the output is correct.

                        I'm also seeing weird floating point results. CPU or GPU, the results are completely wrong, they are orders of magnitude too small or too large e.g., 11.347567 vs 2.0599841e-017, where the first is the correct solution. I'm guessing I am hitting some type of overflow/underflow but I am not sure why.

                        The kernel I am using is attached. And it runs using a 2D NDRangeKernel call (in a loop). I know I am calling the enque correctly because I get the expected results of the check done at the start of the post.

                        #pragma OPENCL EXTENSION cl_amd_fp64 : enable // Kernel for kchunks algorithm __kernel void calc_ter_p_kchunks(int LENGTH, int kstart, int kstop, __global double* ter_p) { size_t j = get_global_id(0); size_t i = get_global_id(1); double jm; double im; double ik; double il; int ncols = LENGTH+1; double G; double tmp = 0.0; for(int k = kstart; k < kstop; k++){ if( k != i){ for(int l = 0; l < ncols; l++){ if( l != i && l != k){ G = 1; for(int m = 0; m < ncols; m++){ if( m != i && m != k && m != l){ jm = j-m; im = i-m; G = G * jm/im; } } // end M loop ik = i-k; il = i-l; tmp = tmp + ((1.0/ik)/il)*G; } // end if l != i } // end l loop } // end if k != i } // end k loop ter_p[i*(LENGTH+1) + j] = tmp; }

                          • Parallelizing nested loops
                            fpaboim

                            Conditionals in OpenCL should be avoided when possible because since instructions are executed in lockstep while one thread is executing the conditional, if another thread doesn't meet the conditional's criteria it'll just stay idle. If you have a bunch of nested conditionals like that you're probably gonna have a lot of idle compute units, which won't help your performance. Furthermore it probably would be good to read through most of the literature amd provides, and then reread the sections you found pertinent to your problem. From what I gather from reading your code excerpt it looks like some sort of factorization and being such I would advise you to look at the matrix multiplication example to get some grasp on tiling for opencl/gpus, try to remove the conditionals (maybe its possible to separate the task?), and use images if you're writing to a global matrix - although your performance gain will depend on how much time you scatter operation actually consumes but I'm just mentioning it because it can be a potential bottleneck with a straightforward solution.

                            • Parallelizing nested loops
                              genaganna

                               

                              Originally posted by: Otterz Well I know it runs, to test it, I just have it save the row major index in the blah(i,j) address (i*NCOLS + j) on the device (GPU or CPU both work fine), and then I read the buffer on the host. And the output is correct.

                               

                              After looking at your kernel code, I am sure it runs always as your computation in your code is very simple.

                              I'm also seeing weird floating point results. CPU or GPU, the results are completely wrong, they are orders of magnitude too small or too large e.g., 11.347567 vs 2.0599841e-017, where the first is the correct solution. I'm guessing I am hitting some type of overflow/underflow but I am not sure why.

                               

                              Are you getting wrong results both on CPU and GPU. but you said " I read the buffer on the host. And the output is correct." in first statement. I did not understand where you are getting this precision issue.

                               

                                • Parallelizing nested loops
                                  Otterz

                                  The results are wrong on both CL_DEVICE_TYPE_GPU or _CPU.

                                  I tried changing everything to float, and I still get the same problem. My intuition tells me that I am doing something wrong with memory - but if that is true, then why am I am able to change the last line "ter_p[i*(LENGTH+1) + j] = i*(LENGTH+1) + j;", and my results on the host are numbered correctly.

                                  The numerical problem would make sense if the kernel is running more times than it should (small numbers become smaller, large numbers become larger), but it is still awkward.

                                   

                                  On a side note:

                                  In regards to cl::Kernel objects, if I queue a Kernel object, can I reuse the same Kernel object (changing args) and queue it again? e.g.:

                                   

                                  cl::Kernel kernel(program, "func", &err); for(int i=0; i < N; i++){ kernel.setArg(0, sizeof(int), &a); queue.enqueueNDRangeKernel( kernel, ... ); a = <some math>; }

                                    • Parallelizing nested loops
                                      genaganna

                                       

                                      Originally posted by: Otterz The results are wrong on both CL_DEVICE_TYPE_GPU or _CPU.

                                       

                                      I tried changing everything to float, and I still get the same problem. My intuition tells me that I am doing something wrong with memory - but if that is true, then why am I am able to change the last line "ter_p[i*(LENGTH+1) + j] = i*(LENGTH+1) + j;", and my results on the host are numbered correctly.

                                       

                                      The numerical problem would make sense if the kernel is running more times than it should (small numbers become smaller, large numbers become larger), but it is still awkward.

                                       

                                       



                                      Generally you should face precision issues if you run on CPU. There could be a problem with your implemenation. Make sure first it works on CPU and then look for GPU.

                                       

                                      On a side note:

                                       

                                      In regards to cl::Kernel objects, if I queue a Kernel object, can I reuse the same Kernel object (changing args) and queue it again? e.g.:

                                       

                                       

                                      You can do this.  Are you waiting for event after enqueueNDRange?  If not, add waiting event code and run.

                                       

                                        • Parallelizing nested loops
                                          Otterz

                                          I do not want to wait, so do I need to allocate a new Kernel object then? I didn't see this addressed in the spec, it does say that the Kernel object copies the args into the object, but I didn't see whether the queue copies the kernel object, or just retains the reference.

                                          If I flush the queue, can I then modify the Kernel object?

                                          • Parallelizing nested loops
                                            genaganna

                                             

                                             

                                            You can do this.  Are you waiting for event after enqueueNDRange?  If not, add waiting event code and run.

                                             

                                             

                                            Otterz,

                                                     Ideally not requied to add event waiting code after enqueueNDRange. Let us try that way and see you will get correct results or not. if you get correct results with this, then there is bug in runtime.

                                              • Parallelizing nested loops
                                                Otterz

                                                I rewrote using tiling and thread blocks, still had the numerical issue.

                                                I stumbled upon the printf extension in the PDFs, so add it - my numerical problem was caused by using the wrong local id. So my code had every thread updating the shared memory causing problems.

                                                That printf extension is awesome! Also the results are exact compared with the OpenCL CPU or GPU, and non OpenCL fortran90, and C++ test codes.

                                                 

                                                I don't know if I need the extra variables (stuff like im, jm, etc..) I was worried my problem was due to casting so I add them in originally I'll probably leave them in, unless someone has a good reason not too.

                                                // LENGTHrnel for 3D NDRange algorithm __LENGTHrnel void calc_ter_i_3DNDRange(int LENGTH, __global double* ter_i, // pointer to global shared ter_i array __local double* contributions) // pointer to num_threads size local array { size_t k = get_global_id(2); size_t j = get_global_id(1); size_t i = get_global_id(0); // Number of threads in thread block (work group) size_t num_threads = get_local_size(2); size_t n = get_local_id(2); int jm; int im; int ik; int il; int ncols = LENGTH+1; double G; double tmp = 0.0; // This memory is shared by the thread block contributions[n] = 0.0; // No contribution yet if( k != i){ for(int l = 0; l < ncols; l++){ if( l != i && l != k){ G = 1.0; for(int m = 0; m < ncols; m++){ if( m != i && m != k && m != l){ jm = j-m; im = i-m; G = G * ( (double) jm / (double) im ); } } // end M loop ik = i-k; il = i-l; tmp = tmp + (( (double) 1.0 / (double) ik)/ (double) il)*G; } // end if l != i } // end l loop } // end if k != i /* if(i == 0 && j == 0) printf("n=%d, [%d][%d][%d] contribution, %f\n",n,i,j,k,tmp); */ // Store contribution in local shared memory contributions[n] = tmp; // Wait for all threads to reach this point barrier(CLK_LOCAL_MEM_FENCE); // n is like rank in MPI if( n == 0){ double tmp2 = ter_i[i*ncols + j]; tmp = 0.0; // Sum the contributions of all threads for(int i=0; i < num_threads; i++){ tmp += contributions[i]; } ter_i[i*ncols + j] = tmp2 + tmp; } }