Archives Discussions

niravshah00 · ‎03-10-2010

Hi ,

I am new to Brook+ programming . I read a the Brook+ programming guide and could not figure out how to create mutiple threads in the kernel to take advantage of the GPU .

The Goal is to convert the a algorithm with 4 nested for loop to a multi threaded Brook+ program so as to improve the perfomance .

Is there anything where i can learn how to do this ?

Thanks,

Nirav

gaurav_garg · ‎03-10-2010

Specific sections of http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf might help you. Also, looking at Brook+ samples shippped with SDK should be a good start.

niravshah00 · ‎03-11-2010

Gaurav,

I really appreciate your help .

Thanks a lot

niravshah00 · ‎03-13-2010

The samples aren't very helpful

niravshah00 · ‎03-18-2010

All the examples doesn teach you how to code in brook+ and the documentation available with the sdk is not for beginers .

Wondering how to people start working on Brook+.

I am guessing there must be some resources which i am not aware of.

gaurav_garg · ‎03-18-2010

Have you looked at Brook+ tutorials shipping with SDK. Those are pretty basic.

niravshah00 · ‎03-19-2010

Gaurav ,

I looked at the tutorials and most of them are pretty basic and are comparision on GPU and CPU

None of those tell me how to create a multi- threaded brook+ code.

gaurav_garg · ‎03-19-2010

Are you talking about creation of multiple threads inside kernel or writing multi-threaded host program with Brook+?

The number of threads that execute on GPUs is implicit and is equal to number of elemenets in your output stream.

You can explicitly control this by using domainOffset and domainSize methods on your kernel.

For, multi-threaded host-program you can take a look at MultiGPU tutorial.

niravshah00 · ‎03-19-2010

I want to create threads on GPU to take full advantage of the multiple processsor on GPU .

I have a sequential Java program with 6 nested for loops for 6 variables of an equation .So i want to create threads for each combination of these varaibles.

For ex suppose the variables are a,b,c,x,y,z

then a thread for a=3,b=3,c=3,x=3,y=3,z=3
                            a=3,b=3,c=3,x=3,y=3,z=4
                            a=3,b=3,c=3,x=3,y=3,z=5
                            a=3,b=3,c=3,x=3,y=3,z=6
and so on

And then each threads so the math to satisfy a condition .

I read about the Attribute[GroupSize(GROUP_SIZE, 1, 1)].As far as i understood this will create threads of size GROUP_SIZE with a maximum of 1024 per group .But I would require more than that as in multiple group.

Not sure how to create multiple groups.

Also can a kernel function call another kernel function like,

Attribute[GroupSize(GROUP_SIZE, 1, 1)
kernal void function1()
{
.
.
function2();
.
}

Attribute[GroupSize(GROUP_SIZE, 1, 1)
kernel void function2()
{
.
.
}

gaurav_garg · ‎03-19-2010

Total number of threads are decided based on your output stream size. The number of groups is automatically decided based on total number of threads and Group size that you specify.

If you don't want to use LDS, there is no need to specify group in your kernel.

Let say you want to add two matrices

for(int i = 0; i < H; ++i) for(int j = 0; j < W; ++j) { c = a + b; } This is similar to //kernel code kernel void sum(float a[][], float b[][], out float c<>) { int i = instance().y; int j = instance().x; c = a + b; } // host code int dim = {W, H}; brook::Stream<float> a(2, dim); brook::Stream<float> b(2, dim); brook::Stream<float> c(2, dim); // initialize input streams using Stream::read() // call kernel - number of threads = W * H (output stream dimension) sum(a, b, c);

niravshah00 · ‎03-19-2010

Gaurav,

I have seen these matrices example ans also understood your point

but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations

a 1000 - 10000

b 1000 - 10000

c 1000 - 10000

x 3-10

y 3-10

z 3-10

So i dont think the matrices example would work here

I did not understand how exactly are u trying to relate my example with the matrices .

If you want i can show you my nested loops here

jeff_golds · ‎03-19-2010

Originally posted by: niravshah00 Gaurav,

I have seen these matrices example ans also understood your point but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations
a 1000 - 10000
b 1000 - 10000
c 1000 - 10000
x 3-10
y 3-10
z 3-10
So i dont think the matrices example would work here
I did not understand how exactly are u trying to relate my example with the matrices .

10000x10000x10000 is a lot of loops! In any event, what you want to do is something like:

- Compute amount of work you want per thread.

- In the matrix example, only 1 item was handled per thread

- Compute the number of threads to do the work

So if you were adding two matrices of dimension 1000x1000, you could submit a 2D workground size of 1000x1000 threads where each thread computes a single addition.

In your case, you may find it easier to handle the small inner loops for each thread, then submit a 3D workgroup size of, say, 1000x1000x1000 threads. (I don't know if Brook+ supports 3D workgroups, if not, you could try OpenCL )

So your kernel would be something like:

for (i = 0; i < H; i++)

for (j = 0; j < W; j++)

for (k = 0; k < D; k++)

{

// Some work here depending on i,j,k plus the 3D work group id

}

niravshah00 · ‎03-20-2010

Jeff ,

I did not understand the what you are trying to tell me.

well my equation is something like this

A^x + B^ y = C^z

And i am solving for 'z'

Now my idea was if brook plus supported 3D work group i would create and 3D and so the index of the 3d array would give me the values for a,b,c and other 2 d arraywould give me the values for xand y and then i would solve for z .

niravshah00 · ‎03-22-2010

I tried 3D matrices in brook+ and it works i don't know the maximum size limit on it

I tried 10x10x10 and it worked and 100x100x100 gave me unhandled exception stack overflow .

gaurav_garg · ‎03-22-2010

Stack overflow occurs if you are trying to allocate too much data on stack.

If you are trying to allocate your matrices on stack, allocate them on heap.

niravshah00 · ‎03-22-2010

I still did not figure out the threads things

How do you want me to use the matrices for my equation

also let me know if my idea of using the index of matrix as the value of my varaibles is correct?

niravshah00 · ‎03-23-2010

what i think is its waste of memory because i am not going to use the matrices .

And What if I want to control the number of threads rather than based on the matrices

This is the area where i am stuck.

thanks,

Nirav

gaurav_garg · ‎03-23-2010

You can use domainOffset and domainSize operators on your kernel. Take a look at ExecDomain sample shipping with SDK.

niravshah00 · ‎03-23-2010

It does not say anything about how it work or what is it used for

just tell me how can i create larger number of threads like a*b*c number of threads where a,b,c being large number & each of these threads in turn creating x*y threads to calculate z

niravshah00 · ‎03-23-2010

i get this error when i try to call kernel function from another kernel

ERROR--2: Problem with call expression in kernel: callee unknown

gaurav_garg · ‎03-24-2010

Can you post your code?

niravshah00 · ‎03-24-2010

Also , If u see the main i can create streams only of 10,10,10

What do i have to do to scale the 3 stream to a huge value.

When i increase and execute it terminates i remember u suggesting em allocation on heap how to do it?

kernel void threadABC(out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+1000; B = instance().y+1000; C = instance().z+1000; gcdAB = findGcd(A,B); //gcdAC = gcd(A,C); //gcdBC = gcd(B,C); //if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { float sum = pow((float)A, (float)X)+pow((float)B, (float)Y); float Z = (log((float)sum)/log((float)C)); float epsillon = 10E-4f; } } //} } kernel int findGcd(int u,int v) { int gcd = 1; int r ; int num1=u; int num2 =v; while (1) { if (num2 == 0) { gcd = num1; break; } else { r = num1 % num2; num1 = num2; num2 = r; } } return gcd; } int main(int argc, char ** argv) { // int i,j; int a<10,10,10>; //float input_a[10][10][10]; threadABC(a); return 0; }

gaurav_garg · ‎03-24-2010

You have defined findGCD function below threadABC. Similar to C, parsing threadABC generates an error as it has not seen any findGDC symbol. Just place findGCD above threadABC.

Stack exception might be coming when you allocate your matrix like this-

float input_a[100][100][100];

instead of this use

float* input_a = new float[100*100*100];

niravshah00 · ‎03-24-2010

Thanks Gaurav ,

Thanks a ton

niravshah00 · ‎03-24-2010

What is the limit on the stream size ,

I am asking this because my range should be scalable if there is limit do i have to cap it and recall the kernel function

gaurav_garg · ‎03-24-2010

Brook+ is implemented on top of CAL and it uses CAL textures internally.

The size limit is 8192*8192.

niravshah00 · ‎03-24-2010

Then what do i have to do increase the limit because i m pretty sure that i will be having range greater than 8192*8192

gaurav_garg · ‎03-24-2010

You might want to partition your data into multiple tiles and process one tile after another.

niravshah00 · ‎03-24-2010

By tiles you mean

first a<8192,90,90> (since 90*90 is 8100)
then a<8192,90,90>
.
.
.
.
.
till a<8192,rangeB,rangeC>

assuming that range of A is 8192

But then this would require a for loop which will call the kernel function in a loop but that would be sequential i mean each call to the kernel would have to wait till the previous call returns
Can't i create multiple groups of size 8192*8192 .I know it is a lot of threads but then that is what the aim is to prallelize the whole range and utlilize GPU to the maximum.

gaurav_garg · ‎03-25-2010

On current GPUs, you can run only one kernel at a time. Even if you use multiple groups, the kernel call will have to wait for previous call.

But, multiple tiles can help you in hiding data transfer overhead. You can overlap data-transfer and kernel call. FYI, both streamRead and kernel call are asynchronous.

niravshah00 · ‎03-25-2010

So ,

Can i do like two function one with stream with 2d for A and B

And other with 3d for C,X,Y

kernel function1(out int abstream<8192,8192>{
.
.
function2( cxyStream);
.
}

kernel function2 (out int cxyStream<8192,10,10>{

}

So will each thread in function1 call function2 will in turn will create 8192*10*10 threads

gaurav_garg · ‎03-25-2010

When you use reular streams (use <>, you never define dimension like this.

Also, when you call another kernel from main kernel, the kernel is called only for the element on which the main kernel is working. You cannot lauch multiple threads from inside a kernel. function2 will get inlined in function1.

niravshah00 · ‎03-25-2010

So that means we can have only 8192*8192 threads ruuning in Brook+ at a time.
There is no other way to have more threads running at one point of time?

gaurav_garg · ‎03-26-2010

The limit of 8192*8192 is on stream size. You can create more than 8192*8192 threads if you use scatter stream with domainSize operator. But, scatter streams have some performance overhead.

niravshah00 · ‎03-26-2010

I looked in the examples in the sdk but could not understand what is scatter streams .

My code would need more threads than 8192*8192.

Any help ?

gaurav_garg · ‎03-26-2010

For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

kernel void scatter(out float4 a[][])

{

int i = instance().x; int j = instance().y;

a[2*i] = 0;

a[2*i+1] = 1;

}

//host code

unsigned int dim[] = {width, height};

brook::Stream<float4> scatterStream(2, dim);

satter.domainOffset(uint(0,0,0,0));

scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

scatter(streamStream);

niravshah00 · ‎03-26-2010

So can scatter work for 3 dimension stream as well?

niravshah00 · ‎03-29-2010

Originally posted by: gaurav.garg For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

kernel void scatter(out float4 a[][])

{

   int i = instance().x; int j = instance().y;

    a[2*i] = 0;

   a[2*i+1] = 1;

}

//host code

unsigned int dim[] = {width, height};

brook::Stream scatterStream(2, dim);

satter.domainOffset(uint(0,0,0,0));

scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

scatter(streamStream);

i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and

also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

genaganna · ‎03-31-2010

Originally posted by: niravshah00
Originally posted by: gaurav.garg For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

kernel void scatter(out float4 a[][])
{
    int i = instance().x; int j = instance().y;
     a[2*i] = 0;
    a[2*i+1] = 1;
}
//host code
unsigned int dim[] = {width, height};
brook::Stream scatterStream(2, dim);
satter.domainOffset(uint(0,0,0,0));
scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size
scatter(streamStream);
i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and
also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

Scatter works for 3 dimensional streams.

In about code please change code from scatter.domainSize(uit4(2*width, height)); to scatter.domainSize(uit4(width / 2, height));

Please paste complete code here.

niravshah00 · ‎03-31-2010

Well I haven't written any concrete code I was just trying to learn and understand how scatter works .

A brief history what lead to me to use scatter,

The problem is to solve equation for a which has 6 variables A,BC, x,y,z

Now the range for A,B,C will be very high like 1000 to 10,000 the initial solution i thought was to use 3D stream and using the index as the values of A,B,C but since there is limitation on the size of stream found from this forum specially from Gaurav that there is something as scatter stream to get more threads.

But couldnt figure out how will i get all the permutation of A,B,C using scatter stream.

Where can i find how this domain size actually works.

Archives Discussions

Multithreaded Brook+ algorithm from a nested for loop