cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rexiaoyu
Journeyman III

How to use compute shader in Brook+

To use the compute shader, I know I should use Attribute to specify thread group size and then do data sharing ..... But I still get confused when I get into the actual code, for example, I wanna use compute shader to do simple array addition, and each thread process multiple elements. I wrote the following code, but it seems I got index out of range problem. Anyone can fix it ?

Code:

Attribute[GroupSize(64, 1, 1)]

kernel void

blockAdd(float a[], float b[], out float c[])

{

int tid = instance().x;

//every thread process len elements, len =  1, 2, 4,.....

int len= 2;

 

int start = tid * len;

int i;

for (i = 0; i < len; i++)

{

c[start + i] = a[start + i] + b[start + i];

}

}



 

0 Likes
19 Replies
rexiaoyu
Journeyman III

Besides the out of range problem, I think maybe exist other errors.

The user guide 1.2.5.5 streaming stores section says that, "The streaming writes occur only once per kernel invocation, only one write is allowed per segment....". Does it mean that for each out stream parameter in the kernel, we can only write one element  back if the parameter is array like. 

0 Likes

On a scatter output stream, you can write at multiple places. That statement is for regular output streams (declared with <>.

0 Likes

What is Attribute[GroupSize(64, 1, 1)] mean?

0 Likes

Organize 64 threads into a group. Well, frankly speaking, I don't know how to write the compute shader code in Brook+. Can you fix the code?

0 Likes

I cannot know about out of range problem until I know the sizes of a, b and c and domain of execution for which this kernel is being invoked?

0 Likes

Originally posted by: gaurav.garg I cannot know about out of range problem until I know the sizes of a, b and c and domain of execution for which this kernel is being invoked?

 

Assume the size of a, b, c is 256, so is the domain size.

0 Likes

Then it is obvious that you are going out of range. Value of instance().x is decided by domain of execution that is going to be 0-256 in your case. So, your index will vary between 0 - 511.

0 Likes

Originally posted by: gaurav.garg Then it is obvious that you are going out of range. Value of instance().x is decided by domain of execution that is going to be 0-256 in your case. So, your index will vary between 0 - 511.

 

Thank you. The cause of out of range problem is very clear. Can you tell me how the compute shader in Brook+ looks like? Take the array additon for example.  

 

0 Likes

Hi, Gaurav,

Here is another stupid question.

In brook+, we use Attribute to set thread group size, for example, 64, then which 64 threads are grouped into a thread group?

In CAL, there is program grid, block..such kind of thing specifying compute shader. Still take the array additon for example, assuming the size is 256, in the kernel each thread dos 2 elements addtion and products 2 outpus. How should I specify the program grid, block?

 

0 Likes

Can you tell me how the compute shader in Brook+ looks like? Take the array additon for example.


You can take a look at Brook+ LDS tutorial coming with SDK under CPP\tutorials\LDS and section 2.17 of stream computing user guide.

In brook+, we use Attribute to set thread group size, for example, 64, then which 64 threads are grouped into a thread group?


In compute shader mode threads are invoked in a linear fashion. That means conecutive threads form a group. So, threads with instance() value 0-63 would be part of one group and then threads 64-127 would be part of another group.

0 Likes

Originally posted by: gaurav.garg  

 

In compute shader mode threads are invoked in a linear fashion. That means conecutive threads form a group. So, threads with instance() value 0-63 would be part of one group and then threads 64-127 would be part of another group.

 

In user guide 1.2.4 Thread Creation section, it mentions that every 2x2 threads are sent to the thread queue and processed together, which is not a liner fashion , and the conecutive threads maybe are not in the same SIMD . While In compute shader ,the threads in a thread group must be processed in the same SIMD. So In computer shader mode, the way of threads processing is different from the quad fasion, right?

0 Likes

I think user guide is not updated for Compute shader mode. In compute shader mode, a group of 4 threads is processed together, but these 4 threads have consecutive thread IDs.

0 Likes

Hi, Gaurav,

I don't quite understand the difference between pixel shader and compute shader. For image processing, in pixel shader it will launch a thread for every pixel and each thread can only write to its own destination and there is no communicaiton between threads; in compute shader,Does it also launch a thread for every pixel? And the difference from pixel shader is that the threads in compute shader can be grouped together and communicate.  

I have read the cal idct sample, which is done in compute shader, but I still can not understand it. For 64*64 matrix, it specifies the domain width is 64,  domain height is 64 ,and then gridblock size is (64, 1, 1), grid size is (64, 1, 1), and in the kernel each thread processes and writes 64 elements back. Why it doesn't cause the index out of range problem?

0 Likes

In compute shader (or scatter output mode in pixel shader) there is no relation between launched threads (domain of execution) and pixels (output domain).

I have not looked at cal idct sample in more detail, but it seems that each thread writes to 8 places, but these 8 places are shared by 8 consecutive threads. That means threads from 0-7 write in memory location from 0-7.

0 Likes

After try what so called CS, I get slowdown in it. Just adding attribute makes the whole thing slower that PS mode.

I have four kernels run for 0.2seconds all, and after adding attribute for just a kernel, the total running time goes to 0.48seconds. Weird, a kernel runs much slower than the whole thing.

0 Likes

Originally posted by: ryta1203 riza,

  This thread may help: http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=117364&enterthread=y

Yeah, but the sample is in CAL. I don't understand how to do it in Brook+

Tried to expand to 2D group size but error encountered.

0 Likes

Originally posted by: gaurav.garg In compute shader (or scatter output mode in pixel shader) there is no relation between launched threads (domain of execution) and pixels (output domain).

 

I have not looked at cal idct sample in more detail, but it seems that each thread writes to 8 places, but these 8 places are shared by 8 consecutive threads. That means threads from 0-7 write in memory location from 0-7.

 

Gaurav, do you know why in cal compute shader the program block width need to be equal to the thread group size as given in IL kernel? 

0 Likes

Thank you. Then how about my first problem? Compute Shader?

0 Likes