I know that, in Brook+ programming model, each element in output stream becomes a thread automatically and a thread should be communication-free.
The following code is communcation-free with respect to outer loop (index i) but it seems it can't be parallelizable in Brook+ because inner loop is still not communication free. Is there any way to implement this code in Brook+ with respect to outer loop or in any way?
for(int i=0; i<N; i++) {
for(int j=1; j<N; j++) {
A[ i ][ j ] = A[ i ][ j-1 ] + 1;
}
}
The outer loop can not be easily parallelized, since each iteration uses the same array A. In other words, each iteration depends on the result of the previous iteration. So that the outer loop is a serial loop.
To parallelize your program, you need to develop new algorithm. For example, afther a quick examine, you will find the inner loop can be parallized, since A
the729,
Message editor in this forum screwed up my code. I fixed the code above. I think your message is corrupted as well. I can't understand your reply.
bjang,
The most straightforward way of doing this:
kernel void foo(float A_in[], float A_out<> )
{
int i, j;
i = instance().y;
j = instance().x;
A_out = A [ i ] [j-1]+1;
}
ryta1203,
Your code is not correct since output is undeterministic as the729 mentioned.
Originally posted by: ryta1203 bjang,
The most straightforward way of doing this:
kernel void foo(float A_in[], float A_out<> )
{
int i, j;
i = instance().y;
j = instance().x;
A_out = A [ i ] [j-1]+1;
}
Originally posted by: bjang ryta1203,
Your code is not correct since output is undeterministic as the729 mentioned.
Originally posted by: ryta1203 bjang,
The most straightforward way of doing this:
kernel void foo(float A_in[], float A_out<> )
{
int i, j;
i = instance().y;
j = instance().x;
A_out = A [ i ] [j-1]+1;
}
bjang, you are correct I really wasn't paying attention, sorry. So yes, the outside is parallelizable; however, local arrays are not supported yet so unrolling the inner loop is what you will ahve to do, along with changing your data structure or rearranging data possibly.
Dup.
Dup
hi, bjang
I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A
However, this algorithm can be improved, since A
I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.
Thanks, the729.
You mentioned "the serial inner loop can be done in GPU kernel, while the outer loop is done parallel". I agree and it is possible in CUDA but this is impossible in Brook+ since each element of A[][] becomes thread and we have dependences with respect to "j". My original question was "Is there any way to make only outer loop ("i") become thread?
Originally posted by: the729 hi, bjang
I replied before you edited the original message. Now it is not hard to parallelize it. For the most straightforward way, the inner loop can not be parallelized, since A
depends on A[j-1].
However, this algorithm can be improved, since A
=A[0]+j actually. Therefore, both loops can be parallelized with GPU.
I guess you just use this code to demonstrate a serial inner loop and a parallelizable outer loop. In this case, the serial inner loop can be done in GPU kernel, while the outer loop is done parallel. Remember that some algorithms can be improved to be fully parallelized like your example.
...each element of A[][] becomes thread...
This is not necessary. You can specify the domain of execution of a brook kernel. But it is a problem that ATI limits the number of output stream to 8, unless you use scatter output.
Um.. I thinks a scatter kernel meets your requirement. But usually, scatter output is slower than stream output.