Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Multithreaded Brook+ algorithm from a nested for l...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-10-2010
03:28 PM

Multithreaded Brook+ algorithm from a nested for loop

Hi ,

I am new to Brook+ programming . I read a the Brook+ programming guide and could not figure out how to create mutiple threads in the kernel to take advantage of the GPU .

The Goal is to convert the a algorithm with 4 nested for loop to a multi threaded Brook+ program so as to improve the perfomance .

Is there anything where i can learn how to do this ?

Thanks,

Nirav

54 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-10-2010
04:34 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-11-2010
05:15 PM

Gaurav,

I really appreciate your help .

Thanks a lot

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-13-2010
12:35 AM

The samples aren't very helpful

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-18-2010
01:04 AM

All the examples doesn teach you how to code in brook+ and the documentation available with the sdk is not for beginers .

Wondering how to people start working on Brook+.

I am guessing there must be some resources which i am not aware of.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-18-2010
11:00 AM

Have you looked at Brook+ tutorials shipping with SDK. Those are pretty basic.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
12:53 AM

Gaurav ,

I looked at the tutorials and most of them are pretty basic and are comparision on GPU and CPU

None of those tell me how to create a multi- threaded brook+ code.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
07:36 AM

Are you talking about creation of multiple threads inside kernel or writing multi-threaded host program with Brook+?

The number of threads that execute on GPUs is implicit and is equal to number of elemenets in your output stream.

You can explicitly control this by using domainOffset and domainSize methods on your kernel.

For, multi-threaded host-program you can take a look at MultiGPU tutorial.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
01:19 PM

I want to create threads on GPU to take full advantage of the multiple processsor on GPU .

I have a sequential Java program with 6 nested for loops for 6 variables of an equation .So i want to create threads for each combination of these varaibles.

For ex suppose the variables are a,b,c,x,y,z

then a thread for a=3,b=3,c=3,x=3,y=3,z=3

a=3,b=3,c=3,x=3,y=3,z=4

a=3,b=3,c=3,x=3,y=3,z=5

a=3,b=3,c=3,x=3,y=3,z=6

and so on

And then each threads so the math to satisfy a condition .

I read about the Attribute[GroupSize(GROUP_SIZE, 1, 1)].As far as i understood this will create threads of size GROUP_SIZE with a maximum of 1024 per group .But I would require more than that as in multiple group.

Not sure how to create multiple groups.

Also can a kernel function call another kernel function like,

Attribute[GroupSize(GROUP_SIZE, 1, 1)

kernal void function1()

{

.

.

function2();

.

}

Attribute[GroupSize(GROUP_SIZE, 1, 1)

kernel void function2()

{

.

.

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
01:29 PM

Total number of threads are decided based on your output stream size. The number of groups is automatically decided based on total number of threads and Group size that you specify.

If you don't want to use LDS, there is no need to specify group in your kernel.

Let say you want to add two matrices

for(int i = 0; i < H; ++i) for(int j = 0; j < W; ++j) { c

= a + b ; } This is similar to //kernel code kernel void sum(float a[][], float b[][], out float c<>) { int i = instance().y; int j = instance().x; c = a + b ; } // host code int dim = {W, H}; brook::Stream<float> a(2, dim); brook::Stream<float> b(2, dim); brook::Stream<float> c(2, dim); // initialize input streams using Stream::read() // call kernel - number of threads = W * H (output stream dimension) sum(a, b, c);

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
04:42 PM

Gaurav,

I have seen these matrices example ans also understood your point

but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations

a 1000 - 10000

b 1000 - 10000

c 1000 - 10000

x 3-10

y 3-10

z 3-10

So i dont think the matrices example would work here

I did not understand how exactly are u trying to relate my example with the matrices .

If you want i can show you my nested loops here

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-19-2010
06:00 PM

Originally posted by:Gaurav,niravshah00

I have seen these matrices example ans also understood your point but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations

a 1000 - 10000

b 1000 - 10000

c 1000 - 10000

x 3-10

y 3-10

z 3-10

So i dont think the matrices example would work here

I did not understand how exactly are u trying to relate my example with the matrices .

10000x10000x10000 is a lot of loops! In any event, what you want to do is something like:

- Compute amount of work you want per thread.

- In the matrix example, only 1 item was handled per thread

- Compute the number of threads to do the work

So if you were adding two matrices of dimension 1000x1000, you could submit a 2D workground size of 1000x1000 threads where each thread computes a single addition.

In your case, you may find it easier to handle the small inner loops for each thread, then submit a 3D workgroup size of, say, 1000x1000x1000 threads. (I don't know if Brook+ supports 3D workgroups, if not, you could try OpenCL )

So your kernel would be something like:

for (i = 0; i < H; i++)

for (j = 0; j < W; j++)

for (k = 0; k < D; k++)

{

// Some work here depending on i,j,k plus the 3D work group id

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-20-2010
01:38 AM

Jeff ,

I did not understand the what you are trying to tell me.

well my equation is something like this

A^x + B^ y = C^z

And i am solving for 'z'

Now my idea was if brook plus supported 3D work group i would create and 3D and so the index of the 3d array would give me the values for a,b,c and other 2 d arraywould give me the values for xand y and then i would solve for z .

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-22-2010
02:31 PM

I tried 3D matrices in brook+ and it works i don't know the maximum size limit on it

I tried 10x10x10 and it worked and 100x100x100 gave me unhandled exception stack overflow .

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-22-2010
02:36 PM

Stack overflow occurs if you are trying to allocate too much data on stack.

If you are trying to allocate your matrices on stack, allocate them on heap.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-22-2010
09:06 PM

I still did not figure out the threads things

How do you want me to use the matrices for my equation

also let me know if my idea of using the index of matrix as the value of my varaibles is correct?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-23-2010
12:15 AM

what i think is its waste of memory because i am not going to use the matrices .

And What if I want to control the number of threads rather than based on the matrices

This is the area where i am stuck.

thanks,

Nirav

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-23-2010
04:05 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-23-2010
04:51 AM

It does not say anything about how it work or what is it used for

just tell me how can i create larger number of threads like a*b*c number of threads where a,b,c being large number & each of these threads in turn creating x*y threads to calculate z

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-23-2010
09:41 PM

i get this error when i try to call kernel function from another kernel

ERROR--2: Problem with call expression in kernel: callee unknown

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
03:10 AM

Can you post your code?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
03:22 AM

Also , If u see the main i can create streams only of 10,10,10

What do i have to do to scale the 3 stream to a huge value.

When i increase and execute it terminates i remember u suggesting em allocation on heap how to do it?

kernel void threadABC(out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+1000; B = instance().y+1000; C = instance().z+1000; gcdAB = findGcd(A,B); //gcdAC = gcd(A,C); //gcdBC = gcd(B,C); //if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { float sum = pow((float)A, (float)X)+pow((float)B, (float)Y); float Z = (log((float)sum)/log((float)C)); float epsillon = 10E-4f; } } //} } kernel int findGcd(int u,int v) { int gcd = 1; int r ; int num1=u; int num2 =v; while (1) { if (num2 == 0) { gcd = num1; break; } else { r = num1 % num2; num1 = num2; num2 = r; } } return gcd; } int main(int argc, char ** argv) { // int i,j; int a<10,10,10>; //float input_a[10][10][10]; threadABC(a); return 0; }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
03:36 AM

You have defined findGCD function below threadABC. Similar to C, parsing threadABC generates an error as it has not seen any findGDC symbol. Just place findGCD above threadABC.

Stack exception might be coming when you allocate your matrix like this-

float input_a[100][100][100];

instead of this use

float* input_a = new float[100*100*100];

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
03:48 AM

Thanks Gaurav ,

Thanks a ton

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
05:13 AM

What is the limit on the stream size ,

I am asking this because my range should be scalable if there is limit do i have to cap it and recall the kernel function

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
05:22 AM

Brook+ is implemented on top of CAL and it uses CAL textures internally.

The size limit is 8192*8192.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
04:28 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
04:33 PM

You might want to partition your data into multiple tiles and process one tile after another.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-24-2010
06:13 PM

By tiles you mean

first a<8192,90,90> (since 90*90 is 8100)

then a<8192,90,90>

.

.

.

.

.

till a<8192,rangeB,rangeC>

assuming that range of A is 8192

But then this would require a for loop which will call the kernel function in a loop but that would be sequential i mean each call to the kernel would have to wait till the previous call returns

Can't i create multiple groups of size 8192*8192 .I know it is a lot of threads but then that is what the aim is to prallelize the whole range and utlilize GPU to the maximum.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-25-2010
03:43 AM

On current GPUs, you can run only one kernel at a time. Even if you use multiple groups, the kernel call will have to wait for previous call.

But, multiple tiles can help you in hiding data transfer overhead. You can overlap data-transfer and kernel call. FYI, both streamRead and kernel call are asynchronous.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-25-2010
03:58 AM

So ,

Can i do like two function one with stream with 2d for A and B

And other with 3d for C,X,Y

kernel function1(out int abstream<8192,8192>{

.

.

function2( cxyStream);

.

}

kernel function2 (out int cxyStream<8192,10,10>{

}

So will each thread in function1 call function2 will in turn will create 8192*10*10 threads

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-25-2010
09:13 AM

When you use reular streams (use <>, you never define dimension like this.

Also, when you call another kernel from main kernel, the kernel is called only for the element on which the main kernel is working. You cannot lauch multiple threads from inside a kernel. function2 will get inlined in function1.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-25-2010
05:11 PM

There is no other way to have more threads running at one point of time?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-26-2010
10:28 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-26-2010
03:18 PM

I looked in the examples in the sdk but could not understand what is scatter streams .

My code would need more threads than 8192*8192.

Any help ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-26-2010
04:14 PM

For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

kernel void scatter(out float4 a[][])

{

int i = instance().x; int j = instance().y;

a

a

}

//host code

unsigned int dim[] = {width, height};

brook::Stream<float4> scatterStream(2, dim);

satter.domainOffset(uint(0,0,0,0));

scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

scatter(streamStream);

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-26-2010
08:41 PM

So can scatter work for 3 dimension stream as well?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-29-2010
07:38 PM

Originally posted by:For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.ggaurav.garg

kernel void scatter(out float4 a[][])

{

int i = instance().x; int j = instance().y;

a

[2*i] = 0;

a

[2*i+1] = 1;

}

//host code

unsigned int dim[] = {width, height};

brook::Stream scatterStream(2, dim);

satter.domainOffset(uint(0,0,0,0));

scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

scatter(streamStream);

i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and

also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-31-2010
07:07 AM

Originally posted by:niravshah00Originally posted by:For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.ggaurav.garg

kernel void scatter(out float4 a[][])

{

int i = instance().x; int j = instance().y;

a

[2*i] = 0; a

[2*i+1] = 1; }

//host code

unsigned int dim[] = {width, height};

brook::Stream scatterStream(2, dim);

satter.domainOffset(uint(0,0,0,0));

scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

scatter(streamStream);

i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and

also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

Scatter works for 3 dimensional streams.

In about code please change code from scatter.domainSize(uit4(2*width, height)); to scatter.domainSize(uit4(width / 2, height));

Please paste complete code here.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

03-31-2010
05:11 PM

Well I haven't written any concrete code I was just trying to learn and understand how scatter works .

A brief history what lead to me to use scatter,

The problem is to solve equation for a which has 6 variables A,BC, x,y,z

Now the range for A,B,C will be very high like 1000 to 10,000 the initial solution i thought was to use 3D stream and using the index as the values of A,B,C but since there is limitation on the size of stream found from this forum specially from Gaurav that there is something as scatter stream to get more threads.

But couldnt figure out how will i get all the permutation of A,B,C using scatter stream.

Where can i find how this domain size actually works.