cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bjang
Journeyman III

Is this parallelizable? (Brook+ limitations?)

I know that, in Brook+ programming model, each element in output stream becomes a thread automatically and a thread should be communication-free.

The following code is communcation-free with respect to outer loop (index i) but it seems it can't be parallelizable in Brook+ because inner loop is still not communication free. Is there any way to implement this code in Brook+ with respect to outer loop or in any way?

for(int i=0; i<N; i++) {

   for(int j=1; j<N; j++) {

      A[ i ][ j ] = A[ i ][ j-1 ] + 1;

   }

}

0 Likes
10 Replies
the729
Journeyman III

The outer loop can not be easily parallelized, since each iteration uses the same array A. In other words, each iteration depends on the result of the previous iteration. So that the outer loop is a serial loop.

To parallelize your program, you need to develop new algorithm. For example, afther a quick examine, you will find the inner loop can be parallized, since A = A[-1]+j+1. So that A only depends on A[-1], but not the last result of the inner iteration.

0 Likes

the729,

Message editor in this forum screwed up my code. I fixed the code above. I think your message is corrupted as well. I can't understand your reply.

0 Likes

bjang,

 The most straightforward way of doing this:

kernel void foo(float A_in[], float A_out<> )

{

    int i, j;

    i = instance().y;

    j = instance().x;

    A_out = A [ i ] [j-1]+1;

}

0 Likes

ryta1203,

Your code is not correct since output is undeterministic as the729 mentioned.

 

Originally posted by: ryta1203 bjang,

 

 The most straightforward way of doing this:

 

kernel void foo(float A_in[], float A_out<> )

 

{

 

    int i, j;

 

    i = instance().y;

 

    j = instance().x;

 

    A_out = A [ i ] [j-1]+1;

 

}

 

0 Likes

Originally posted by: bjang ryta1203,

Your code is not correct since output is undeterministic as the729 mentioned.

 

Originally posted by: ryta1203 bjang,

 

 The most straightforward way of doing this:

 

kernel void foo(float A_in[], float A_out<> )

 

{

 

    int i, j;

 

    i = instance().y;

 

    j = instance().x;

 

    A_out = A [ i ] [j-1]+1;

 

}

 

 

bjang, you are correct I really wasn't paying attention, sorry. So yes, the outside is parallelizable; however, local arrays are not supported yet so unrolling the inner loop is what you will ahve to do, along with changing your data structure or rearranging data possibly.

0 Likes

Dup.

0 Likes

Dup

0 Likes

hi, bjang

I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A depends on A[j-1].

However, this algorithm can be improved, since A=A[0]+j actually. Therefore, both loops can be parallelized with GPU.

I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.

0 Likes

Thanks, the729.

You mentioned "the serial inner loop can be done in GPU kernel, while the outer loop is done parallel". I agree and it is possible in CUDA but this is impossible in Brook+ since each element of A[][] becomes thread and we have dependences with respect to "j". My original question was "Is there any way to make only outer loop ("i") become thread?

 

Originally posted by: the729 hi, bjang

 

I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A depends on A[j-1].

 

However, this algorithm can be improved, since A=A[0]+j actually. Therefore, both loops can be parallelized with GPU.

 

I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.

 

0 Likes

...each element of A[][] becomes thread...


This is not necessary. You can specify the domain of execution of a brook kernel. But it is a problem that ATI limits the number of output stream to 8, unless you use scatter output.

Um.. I thinks a scatter kernel meets your requirement. But usually, scatter output is slower than stream output.

0 Likes