10 Replies Latest reply on Mar 29, 2009 5:50 PM by ryta1203

    Is this parallelizable? (Brook+ limitations?)

    bjang

      I know that, in Brook+ programming model, each element in output stream becomes a thread automatically and a thread should be communication-free.

      The following code is communcation-free with respect to outer loop (index i) but it seems it can't be parallelizable in Brook+ because inner loop is still not communication free. Is there any way to implement this code in Brook+ with respect to outer loop or in any way?

      for(int i=0; i<N; i++) {

         for(int j=1; j<N; j++) {

            A[ i ][ j ] = A[ i ][ j-1 ] + 1;

         }

      }

       

        • Is this parallelizable? (Brook+ limitations?)
          the729

          The outer loop can not be easily parallelized, since each iteration uses the same array A. In other words, each iteration depends on the result of the previous iteration. So that the outer loop is a serial loop.

          To parallelize your program, you need to develop new algorithm. For example, afther a quick examine, you will find the inner loop can be parallized, since A[j] = A[-1]+j+1. So that A[j] only depends on A[-1], but not the last result of the inner iteration.

            • Is this parallelizable? (Brook+ limitations?)
              bjang

              the729,

              Message editor in this forum screwed up my code. I fixed the code above. I think your message is corrupted as well. I can't understand your reply.

                • Is this parallelizable? (Brook+ limitations?)
                  ryta1203

                  bjang,

                   The most straightforward way of doing this:

                  kernel void foo(float A_in[], float A_out<> )

                  {

                      int i, j;

                      i = instance().y;

                      j = instance().x;

                      A_out = A [ i ] [j-1]+1;

                  }

                    • Is this parallelizable? (Brook+ limitations?)
                      bjang

                      ryta1203,

                      Your code is not correct since output is undeterministic as the729 mentioned.

                       

                       

                      Originally posted by: ryta1203 bjang,

                       

                       The most straightforward way of doing this:

                       

                      kernel void foo(float A_in[], float A_out<> )

                       

                      {

                       

                          int i, j;

                       

                          i = instance().y;

                       

                          j = instance().x;

                       

                          A_out = A [ i ] [j-1]+1;

                       

                      }

                       

                        • Is this parallelizable? (Brook+ limitations?)
                          ryta1203

                           

                          Originally posted by: bjang ryta1203,

                          Your code is not correct since output is undeterministic as the729 mentioned.

                           

                           

                          Originally posted by: ryta1203 bjang,

                           

                           The most straightforward way of doing this:

                           

                          kernel void foo(float A_in[], float A_out<> )

                           

                          {

                           

                              int i, j;

                           

                              i = instance().y;

                           

                              j = instance().x;

                           

                              A_out = A [ i ] [j-1]+1;

                           

                          }

                           

                           

                          bjang, you are correct I really wasn't paying attention, sorry. So yes, the outside is parallelizable; however, local arrays are not supported yet so unrolling the inner loop is what you will ahve to do, along with changing your data structure or rearranging data possibly.

                      • Is this parallelizable? (Brook+ limitations?)
                        the729

                        hi, bjang

                        I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A[j] depends on A[j-1].

                        However, this algorithm can be improved, since A[j]=A[0]+j actually. Therefore, both loops can be parallelized with GPU.

                        I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.

                          • Is this parallelizable? (Brook+ limitations?)
                            bjang

                            Thanks, the729.

                            You mentioned "the serial inner loop can be done in GPU kernel, while the outer loop is done parallel". I agree and it is possible in CUDA but this is impossible in Brook+ since each element of A[][] becomes thread and we have dependences with respect to "j". My original question was "Is there any way to make only outer loop ("i") become thread?

                             

                             

                            Originally posted by: the729 hi, bjang

                             

                            I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A[j] depends on A[j-1].

                             

                            However, this algorithm can be improved, since A[j]=A[0]+j actually. Therefore, both loops can be parallelized with GPU.

                             

                            I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.

                             

                              • Is this parallelizable? (Brook+ limitations?)
                                the729

                                 

                                ...each element of A[][] becomes thread...


                                This is not necessary. You can specify the domain of execution of a brook kernel. But it is a problem that ATI limits the number of output stream to 8, unless you use scatter output.

                                Um.. I thinks a scatter kernel meets your requirement. But usually, scatter output is slower than stream output.