54 Replies Latest reply on Apr 26, 2010 5:09 PM by niravshah00

    Multithreaded Brook+ algorithm from a nested for loop

    niravshah00

      Hi ,

      I am new to Brook+  programming . I read a the Brook+ programming guide and could not figure out how to create mutiple threads in the kernel to take advantage of the GPU .

      The Goal is to convert the a algorithm with 4 nested for loop to a multi threaded Brook+ program so as to improve the perfomance .

      Is there anything where i can learn how to do this ?

       

      Thanks,

      Nirav

        • Multithreaded Brook+ algorithm from a nested for loop
          gaurav.garg

          Specific sections of http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf might help you. Also, looking at Brook+ samples shippped with SDK should be a good start.

            • Multithreaded Brook+ algorithm from a nested for loop
              niravshah00

              Gaurav,

              I really appreciate your help .

              Thanks a lot

              • Multithreaded Brook+ algorithm from a nested for loop
                niravshah00

                The samples aren't very helpful

                  • Multithreaded Brook+ algorithm from a nested for loop
                    niravshah00

                    All the examples doesn teach you how to code in brook+ and the documentation available with the sdk is not for beginers .

                    Wondering how to people start working on Brook+.

                    I am guessing there must be some resources which i am not aware of.

                     

                      • Multithreaded Brook+ algorithm from a nested for loop
                        gaurav.garg

                        Have you looked at Brook+ tutorials shipping with SDK. Those are pretty basic.

                          • Multithreaded Brook+ algorithm from a nested for loop
                            niravshah00

                            Gaurav ,

                            I looked at the tutorials and most of them are pretty basic and are comparision on GPU and CPU

                            None of those tell me how to create a multi- threaded brook+ code.

                              • Multithreaded Brook+ algorithm from a nested for loop
                                gaurav.garg

                                Are you talking about creation of multiple threads inside kernel or writing multi-threaded host program with Brook+?

                                The number of threads that execute on GPUs is implicit and is equal to number of elemenets in your output stream.

                                You can explicitly control this by using domainOffset and domainSize methods on your kernel.

                                For, multi-threaded host-program you can take a look at MultiGPU tutorial.

                                  • Multithreaded Brook+ algorithm from a nested for loop
                                    niravshah00

                                    I want to create threads on GPU to take full advantage of the multiple processsor on GPU .

                                    I have a sequential Java program with 6 nested for loops for 6 variables of an equation .So i want to create threads for each combination of these varaibles.

                                    For ex suppose the variables are a,b,c,x,y,z

                                    then a thread for a=3,b=3,c=3,x=3,y=3,z=3
                                                                a=3,b=3,c=3,x=3,y=3,z=4
                                                                a=3,b=3,c=3,x=3,y=3,z=5
                                                                a=3,b=3,c=3,x=3,y=3,z=6
                                    and so on

                                    And then each threads so the math to satisfy a condition .

                                    I read about the Attribute[GroupSize(GROUP_SIZE, 1, 1)].As far as i understood this will create threads of size GROUP_SIZE with a maximum of 1024 per group .But I would require more than that as in multiple group.

                                    Not sure how to create multiple groups.

                                    Also can a kernel function call another kernel function like,

                                    Attribute[GroupSize(GROUP_SIZE, 1, 1)
                                    kernal void function1()
                                    {
                                    .
                                    .
                                    function2();
                                    .
                                    }


                                    Attribute[GroupSize(GROUP_SIZE, 1, 1)
                                    kernel void function2()
                                    {
                                    .
                                    .
                                    }

                                     

                                      • Multithreaded Brook+ algorithm from a nested for loop
                                        gaurav.garg

                                        Total number of threads are decided based on your output stream size. The number of groups is automatically decided based on total number of threads and Group size that you specify.

                                        If you don't want to use LDS, there is no need to specify group in your kernel.

                                        Let say you want to add two matrices

                                         

                                         

                                        for(int i = 0; i < H; ++i) for(int j = 0; j < W; ++j) { c[i][j] = a[i][j] + b[i][j]; } This is similar to //kernel code kernel void sum(float a[][], float b[][], out float c<>) { int i = instance().y; int j = instance().x; c = a[i][j] + b[i][j]; } // host code int dim = {W, H}; brook::Stream<float> a(2, dim); brook::Stream<float> b(2, dim); brook::Stream<float> c(2, dim); // initialize input streams using Stream::read() // call kernel - number of threads = W * H (output stream dimension) sum(a, b, c);

                                          • Multithreaded Brook+ algorithm from a nested for loop
                                            niravshah00

                                            Gaurav,

                                            I have seen these matrices example ans also understood your point

                                            but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations

                                            a  1000 - 10000

                                            b  1000 - 10000

                                            c   1000 - 10000

                                            x   3-10

                                            y  3-10

                                            z  3-10

                                            So i dont think the matrices example would work here

                                            I did not understand how exactly are u trying to relate my example with the matrices .

                                            If you want i can show you my nested loops here

                                             

                                              • Multithreaded Brook+ algorithm from a nested for loop
                                                jeff_golds

                                                 

                                                Originally posted by: niravshah00 Gaurav,

                                                I have seen these matrices example ans also understood your point but in the matrices there is a one to one mapping between the elements and what i want is all the possible permutations

                                                a  1000 - 10000

                                                b  1000 - 10000

                                                c   1000 - 10000

                                                x   3-10

                                                y  3-10

                                                z  3-10

                                                So i dont think the matrices example would work here

                                                I did not understand how exactly are u trying to relate my example with the matrices .



                                                10000x10000x10000 is a lot of loops!  In any event, what you want to do is something like:

                                                - Compute amount of work you want per thread.

                                                   - In the matrix example, only 1 item was handled per thread

                                                - Compute the number of threads to do the work

                                                 

                                                So if you were adding two matrices of dimension 1000x1000, you could submit a 2D workground size of 1000x1000 threads where each thread computes a single addition.

                                                In your case, you may find it easier to handle the small inner loops for each thread, then submit a 3D workgroup size of, say, 1000x1000x1000 threads.  (I don't know if Brook+ supports 3D workgroups, if not, you could try OpenCL )

                                                So your kernel would be something like:

                                                for (i = 0; i < H; i++)

                                                   for (j = 0; j < W; j++)

                                                      for (k = 0; k < D; k++)

                                                         {

                                                              // Some work here depending on i,j,k plus the 3D work group id

                                                         }

                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                    niravshah00

                                                    Jeff ,

                                                    I did not understand the what you are trying to tell me.

                                                    well my equation is something like this

                                                    A^x  +  B^ y  = C^z

                                                    And i am solving for 'z'

                                                    Now my idea was if brook plus supported 3D work group i would create and 3D and so the index of the 3d array would give me the values for a,b,c and other 2 d arraywould give me the values for xand y and then i would solve for z .

                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                        niravshah00

                                                        I tried 3D matrices in brook+ and it works i don't know the maximum size limit on it

                                                        I tried 10x10x10 and it worked and 100x100x100 gave me unhandled exception stack overflow .

                                                          • Multithreaded Brook+ algorithm from a nested for loop
                                                            gaurav.garg

                                                            Stack overflow occurs if you are trying to allocate too much data on stack.

                                                            If you are trying to allocate your matrices on stack, allocate them on heap.

                                                              • Multithreaded Brook+ algorithm from a nested for loop
                                                                niravshah00

                                                                I still did not figure out the threads things

                                                                How do you want me to use the matrices for my equation

                                                                also let me know if  my idea of using the index of matrix as the value of my varaibles is correct?

                                                                 

                                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                                    niravshah00

                                                                    what i think is its waste of memory because i am not going to use the matrices .

                                                                    And What if I want to control the number of threads rather than based on the matrices

                                                                    This is the area where i am stuck.

                                                                     

                                                                    thanks,

                                                                    Nirav

                                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                                        gaurav.garg

                                                                        You can use domainOffset and domainSize operators on your kernel. Take a look at ExecDomain sample shipping with SDK.

                                                                          • Multithreaded Brook+ algorithm from a nested for loop
                                                                            niravshah00

                                                                            It does not say anything about how it work or what is it used for

                                                                            just tell me how can i create larger number of threads like a*b*c number of threads where a,b,c being large number & each of  these threads in turn creating x*y threads to calculate z

                                                                              • Multithreaded Brook+ algorithm from a nested for loop
                                                                                niravshah00

                                                                                i get this error when i try to call kernel function from another kernel

                                                                                 

                                                                                ERROR--2: Problem with call expression in kernel: callee unknown

                                                                                    • Multithreaded Brook+ algorithm from a nested for loop
                                                                                      niravshah00

                                                                                      Also , If u see the main i can create streams only of 10,10,10

                                                                                      What do i have to do to scale the 3 stream to a huge value.

                                                                                      When i increase and execute it terminates i remember u suggesting em allocation on heap how to do it?

                                                                                      kernel void threadABC(out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+1000; B = instance().y+1000; C = instance().z+1000; gcdAB = findGcd(A,B); //gcdAC = gcd(A,C); //gcdBC = gcd(B,C); //if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { float sum = pow((float)A, (float)X)+pow((float)B, (float)Y); float Z = (log((float)sum)/log((float)C)); float epsillon = 10E-4f; } } //} } kernel int findGcd(int u,int v) { int gcd = 1; int r ; int num1=u; int num2 =v; while (1) { if (num2 == 0) { gcd = num1; break; } else { r = num1 % num2; num1 = num2; num2 = r; } } return gcd; } int main(int argc, char ** argv) { // int i,j; int a<10,10,10>; //float input_a[10][10][10]; threadABC(a); return 0; }

                                                                                        • Multithreaded Brook+ algorithm from a nested for loop
                                                                                          gaurav.garg

                                                                                          You have defined findGCD function below threadABC. Similar to C, parsing threadABC generates an error as it has not seen any findGDC symbol. Just place findGCD above threadABC.

                                                                                          Stack exception might be coming when you allocate your matrix like this-

                                                                                          float input_a[100][100][100];

                                                                                          instead of this use

                                                                                          float* input_a = new float[100*100*100];

                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                              niravshah00

                                                                                              Thanks Gaurav ,

                                                                                              Thanks a ton

                                                                                                • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                  niravshah00

                                                                                                  What is the limit on the stream size ,

                                                                                                  I am asking this because my range should be scalable if there is limit do i have to cap it and recall the kernel function

                                                                                                    • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                      gaurav.garg

                                                                                                      Brook+ is implemented on top of CAL and it uses CAL textures internally.

                                                                                                      The size limit is 8192*8192.

                                                                                                       

                                                                                                        • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                          niravshah00

                                                                                                          Then what do i have to do increase the limit because i m pretty sure that i will be having range greater than 8192*8192

                                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                              gaurav.garg

                                                                                                              You might want to partition your data into multiple tiles and process one tile after another.

                                                                                                                • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                  niravshah00

                                                                                                                  By tiles you mean

                                                                                                                  first  a<8192,90,90>   (since 90*90 is  8100)
                                                                                                                  then a<8192,90,90>
                                                                                                                  .
                                                                                                                  .
                                                                                                                  .
                                                                                                                  .
                                                                                                                  .
                                                                                                                  till  a<8192,rangeB,rangeC>

                                                                                                                  assuming that range of A is 8192 

                                                                                                                  But then this would require a for loop which will call the kernel function in a loop but that would be sequential i mean each call to the kernel would have to wait till the previous call returns
                                                                                                                  Can't i create multiple groups of size 8192*8192 .I know it is a lot of threads but then that is what the aim is to prallelize the whole range and utlilize GPU to the maximum.

                                                                                                                    • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                      gaurav.garg

                                                                                                                      On current GPUs, you can run only one kernel at a time. Even if you use multiple groups, the kernel call will have to wait for previous call.

                                                                                                                      But, multiple tiles can help you in hiding data transfer overhead. You can overlap data-transfer and kernel call. FYI, both streamRead and kernel call are asynchronous.

                                                                                                                        • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                          niravshah00

                                                                                                                          So ,

                                                                                                                          Can i do like two function one with stream with 2d for A and B

                                                                                                                          And other with 3d for C,X,Y

                                                                                                                           kernel function1(out int abstream<8192,8192>{
                                                                                                                          .
                                                                                                                          .
                                                                                                                          function2( cxyStream);
                                                                                                                          .
                                                                                                                          }

                                                                                                                           

                                                                                                                          kernel function2 (out int cxyStream<8192,10,10>{

                                                                                                                          }

                                                                                                                          So will each thread in function1  call function2 will in turn will create 8192*10*10 threads

                                                                                                                           

                                                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                              gaurav.garg

                                                                                                                              When you use reular streams (use <>, you never define dimension like this.

                                                                                                                              Also, when you call another kernel from main kernel, the kernel is called only for the element on which the main kernel is working. You cannot lauch multiple threads from inside a kernel. function2 will get inlined in function1.

                                                                                                                                • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                  niravshah00

                                                                                                                                  So that means we can have only 8192*8192 threads ruuning in Brook+ at a time.
                                                                                                                                  There is no other way to have more threads running at one point of time?

                                                                                                                                    • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                      gaurav.garg

                                                                                                                                      The limit of 8192*8192 is on stream size. You can create more than 8192*8192 threads if you use scatter stream with domainSize operator. But, scatter streams have some performance overhead.

                                                                                                                                        • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                          niravshah00

                                                                                                                                          I looked in the examples in the sdk but could not understand what is scatter streams .

                                                                                                                                          My code would need more threads than 8192*8192.

                                                                                                                                          Any help ?

                                                                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                              gaurav.garg

                                                                                                                                              For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

                                                                                                                                              kernel void scatter(out float4 a[][])

                                                                                                                                              {

                                                                                                                                                 int i = instance().x; int j = instance().y;

                                                                                                                                                  a[j][2*i] = 0;

                                                                                                                                                 a[j][2*i+1] = 1;

                                                                                                                                              }

                                                                                                                                              //host code

                                                                                                                                              unsigned int dim[] = {width, height};

                                                                                                                                              brook::Stream<float4> scatterStream(2, dim);

                                                                                                                                              satter.domainOffset(uint(0,0,0,0));

                                                                                                                                              scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

                                                                                                                                              scatter(streamStream);

                                                                                                                                                • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                  niravshah00

                                                                                                                                                  So can scatter work for 3 dimension stream as well?

                                                                                                                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                    niravshah00

                                                                                                                                                     

                                                                                                                                                    Originally posted by: gaurav.garg For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

                                                                                                                                                     

                                                                                                                                                    kernel void scatter(out float4 a[][])

                                                                                                                                                     

                                                                                                                                                    {

                                                                                                                                                     

                                                                                                                                                       int i = instance().x; int j = instance().y;

                                                                                                                                                     

                                                                                                                                                        a[j][2*i] = 0;

                                                                                                                                                     

                                                                                                                                                       a[j][2*i+1] = 1;

                                                                                                                                                     

                                                                                                                                                    }

                                                                                                                                                     

                                                                                                                                                    //host code

                                                                                                                                                     

                                                                                                                                                    unsigned int dim[] = {width, height};

                                                                                                                                                     

                                                                                                                                                    brook::Stream scatterStream(2, dim);

                                                                                                                                                     

                                                                                                                                                    satter.domainOffset(uint(0,0,0,0));

                                                                                                                                                     

                                                                                                                                                    scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

                                                                                                                                                     

                                                                                                                                                    scatter(streamStream);

                                                                                                                                                     

                                                                                                                                                     

                                                                                                                                                    i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and

                                                                                                                                                    also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

                                                                                                                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                        genaganna

                                                                                                                                                         

                                                                                                                                                        Originally posted by: niravshah00
                                                                                                                                                        Originally posted by: gaurav.garg For regular streams, domain of execution is decided by output stream size. But, for scatter streams this can be modified. e.g

                                                                                                                                                         kernel void scatter(out float4 a[][])

                                                                                                                                                         {

                                                                                                                                                            int i = instance().x; int j = instance().y;

                                                                                                                                                             a[j][2*i] = 0;

                                                                                                                                                            a[j][2*i+1] = 1;

                                                                                                                                                        }

                                                                                                                                                        //host code

                                                                                                                                                        unsigned int dim[] = {width, height};

                                                                                                                                                        brook::Stream scatterStream(2, dim);

                                                                                                                                                        satter.domainOffset(uint(0,0,0,0));

                                                                                                                                                        scatter.domainSize(uit4(2*width, height)); // number of threads is double the stream size

                                                                                                                                                        scatter(streamStream);

                                                                                                                                                        i tried creating threads like this but for some threads the values of instance().x and instance().y comes to be negative and

                                                                                                                                                        also can i do something like scatter.domainSize(uint4(2*width,2* height,2*depth));

                                                                                                                                                        Scatter works for 3 dimensional streams.

                                                                                                                                                        In about code please change code from  scatter.domainSize(uit4(2*width, height)); to scatter.domainSize(uit4(width / 2, height));

                                                                                                                                                        Please paste complete code here.

                                                                                                                                                          • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                            niravshah00

                                                                                                                                                            Well I haven't written any concrete code I was just trying to learn and understand how scatter works .

                                                                                                                                                            A brief history what lead to me to use scatter,

                                                                                                                                                            The problem is to solve  equation for a which has 6 variables A,BC, x,y,z

                                                                                                                                                            Now the range for A,B,C will be very high like 1000 to 10,000  the initial solution i thought was to use 3D stream and using the index as the values of A,B,C but since there is limitation on the size of stream found from this forum specially from Gaurav that  there is something as scatter stream to get more threads.

                                                                                                                                                            But couldnt figure out how will i get all the permutation of A,B,C using scatter stream.

                                                                                                                                                            Where can i find how this domain size actually works.

                                                                                                                                                              • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                niravshah00

                                                                                                                                                                Also can anyone tell me how do i return results from my kernel the thing is i don't want to use the output stream, as barely 1 or 2 threads would give me result , so I don't want to filter the whole output stream in  the host code ,since the no. of threads would possibly scale (atleast i am trying to do that)

                                                                                                                                                                here is my kernel kernel void threadABC(int startRange,out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { float sum = pow((float)A, (float)X)+pow((float)B, (float)Y); float Z = (log((float)sum)/log((float)C)); float epsillon = 10E-4f; if(isWithinRange(Z,epsillon)){ } } } } } // and here is my host code int main(int argc, char ** argv) { // int i,j; unsigned int dim[] = {2,1,1}; brook::Stream<int> aStream(2,dim); //int a<10,10,10>; threadABC.domainOffset(uint4(0,0,0,0)); threadABC.domainSize(uint4(2*2,3*1,2*1)); threadABC(1000,aStream); return 0; }

                                                                                                                                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                    niravshah00

                                                                                                                                                                    I am very close to closing this .

                                                                                                                                                                     

                                                                                                                                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                        gaurav.garg

                                                                                                                                                                        Domain of execution (calling domainOffset and domainSize) is not supported for 3D streams as of now. If you check error or errorLog on your stream, you should get an error saying this feature is unsupported

                                                                                                                                                                          • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                            niravshah00

                                                                                                                                                                            Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

                                                                                                                                                                            Is there any other way i can get what i want ?

                                                                                                                                                                            Also can your tell me how can i return result from these threads.
                                                                                                                                                                            There is a possiblity that none of these threads would give an answer or only few of those would .

                                                                                                                                                                            I don't want to filter the a huge array when i know only few of them would actually have a solution.

                                                                                                                                                                              • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                                hazeman

                                                                                                                                                                                I have a question for niravshah00. Why do you use Brook+ ? It's unsupported, it's slow and the idea of using streams for parallel programming simply failed. Except for some simple cases it's much harder to write efficient code in brook then in cuda or opencl ( both are almost the same ).

                                                                                                                                                                                If you don't have hardware for opencl you can use CAL++ ( it's quite similar to OpenCL and works on all cards supported by CAL ).

                                                                                                                                                                                 

                                                                                                                                                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                                    niravshah00

                                                                                                                                                                                     

                                                                                                                                                                                    Originally posted by: hazeman I have a question for niravshah00. Why do you use Brook+ ? It's unsupported, it's slow and the idea of using streams for parallel programming simply failed. Except for some simple cases it's much harder to write efficient code in brook then in cuda or opencl ( both are almost the same ).

                                                                                                                                                                                     

                                                                                                                                                                                    If you don't have hardware for opencl you can use CAL++ ( it's quite similar to OpenCL and works on all cards supported by CAL ).

                                                                                                                                                                                     

                                                                                                                                                                                     

                                                                                                                                                                                     

                                                                                                                                                                                    Hi ,

                                                                                                                                                                                    Well I dont have a hardware on my laptop but in the lab I have AMD FireStream 9170.
                                                                                                                                                                                    The reason because i started using brook+ was i started this project in May 2009 when there was no support for OpenCl and now I am stuck with Brook+ because I want to finish off this by May 2010 inorder to graduate by August 2010 .
                                                                                                                                                                                    Also OpenCL does not support FireStream 9170 !
                                                                                                                                                                                    I know my question migh appear that i don't know anything about programming, but there is were limited (infact no material) resources which might help me learn brook+ .
                                                                                                                                                                                    Secondly  the equation i am sloving is very trivial so i thought it would be simpler to do with Brook+ .

                                                                                                                                                                                    I tried reading CAL tutorials in the SDK but it all looked Latin to me
                                                                                                                                                                                    I am open to suggestion and guidance .

                                                                                                                                                                                  • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                                    gaurav.garg

                                                                                                                                                                                     

                                                                                                                                                                                    Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

                                                                                                                                                                                     

                                                                                                                                                                                    Is there any other way i can get what i want ?



                                                                                                                                                                                    You will not get these errors on commandline. You need to check error on your stream. Something like-

                                                                                                                                                                                    if(outputStream.error())

                                                                                                                                                                                    {

                                                                                                                                                                                        std::cout << outputStream.errorLog();

                                                                                                                                                                                    }

                                                                                                                                                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                                                        niravshah00

                                                                                                                                                                                         

                                                                                                                                                                                        Originally posted by: gaurav.garg
                                                                                                                                                                                        Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

                                                                                                                                                                                         

                                                                                                                                                                                         

                                                                                                                                                                                         

                                                                                                                                                                                        Is there any other way i can get what i want ?



                                                                                                                                                                                         

                                                                                                                                                                                        You will not get these errors on commandline. You need to check error on your stream. Something like-

                                                                                                                                                                                         

                                                                                                                                                                                        if(outputStream.error())

                                                                                                                                                                                         

                                                                                                                                                                                        {

                                                                                                                                                                                         

                                                                                                                                                                                            std::cout << outputStream.errorLog();

                                                                                                                                                                                         

                                                                                                                                                                                        }

                                                                                                                                                                                         

                                                                                                                                                                                        Gaurav ,

                                                                                                                                                                                        There is no other way for me to create more threads.
                                                                                                                                                                                        So can i argue that the best way to solve my problem is to use 2D array with domain size and then each thread in turn uses sequential loop for the parameter C and x and y (if u remember i have 5 variables A,B,C with a large range and x ,y with smaller range).
                                                                                                                                                                                        As of now that seems to be the only solution to me.

                                                                                                                                      • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                        huafeihua116

                                                                                                                                        The samples aren't very helpful

                                                                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                              niravshah00

                                                                                                                                              Hi gaurav,

                                                                                                                                              So you think I should change from brook+ to Open CL or CAL++.
                                                                                                                                              As you know my requirements so do you think i can accomplish what i want in brook+ for should i switch.
                                                                                                                                              I would want to finish this asap your help would mean a lot.

                                                                                                                                                • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                  gaurav.garg

                                                                                                                                                  Sorry for delay in answer. I was busy with something else and was not checking my mails.

                                                                                                                                                  I am not sure if I understand your algorithm very well. It will be good if you can post your host algorithm.

                                                                                                                                                  You need to understand that you have to change your algorithm based on GPU architecture and limitations.

                                                                                                                                                  I would suggest you to first try a basic Brook+ implementation and then go for optimizations.

                                                                                                                                                  IIUC, you are doing something like this-

                                                                                                                                                  for a 1000:10000

                                                                                                                                                  for b 1000:10000

                                                                                                                                                  for c 1000:10000

                                                                                                                                                  for x 3:10

                                                                                                                                                  for y 3:10

                                                                                                                                                  for z 3:10

                                                                                                                                                  First you can try to write a kernel that encapsulates last 3 loops (for 'x', 'y', and 'z'). You can create a 2D stream for implicit loop on 'b' & 'c' (If there is size limitations then, you can do processing in tiles). And you can keep loop on 'a' on host side.

                                                                                                                                                    • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                      niravshah00

                                                                                                                                                      Hi,

                                                                                                                                                      Thanks for your reply.
                                                                                                                                                      I can send you my code that i have written in Java.
                                                                                                                                                      So far you understanding has been correct. My equation is A^x  + B^y = C^z

                                                                                                                                                      I am sloving for z . So there are basically 5 variables .Since the range has to be flexible i want to utilize the GPU to as much as i can .

                                                                                                                                                      In my lab i have a machine which has four Firestream 9170. (with dual quadcore processor)

                                                                                                                                                      Secondly I need to figure out by which i can send result i.e all 6 variables only if a z is within the range 10^-8  i dont want to scan the entire stream on the host .Since the only few of the threads would return results .

                                                                                                                                                      Let me know if you would want to see my java (serial) code

                                                                                                                                                       

                                                                                                                                                      Thanks avery much

                                                                                                                                                        • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                          niravshah00

                                                                                                                                                          Any sugesstions on how can i return my results from kernel code to host code?

                                                                                                                                                            • Multithreaded Brook+ algorithm from a nested for loop
                                                                                                                                                              niravshah00

                                                                                                                                                              here is the Java code for my algorithm

                                                                                                                                                              import java.io.File;
                                                                                                                                                              import java.io.PrintStream;



                                                                                                                                                              public class FindPossibleCounterExamples
                                                                                                                                                              {

                                                                                                                                                                  /**
                                                                                                                                                                   * @param args
                                                                                                                                                                   */
                                                                                                                                                                  public static void main(String[] args)
                                                                                                                                                                  {
                                                                                                                                                                      FindPossibleCounterExamples finder = new FindPossibleCounterExamples();
                                                                                                                                                                      try
                                                                                                                                                                      {
                                                                                                                                                                          pOStream = new PrintStream(file);
                                                                                                                                                                      }
                                                                                                                                                                      catch(Exception e)
                                                                                                                                                                      {
                                                                                                                                                                          System.out.println(e.getMessage());
                                                                                                                                                                      }
                                                                                                                                                                      finder.findSuitableC();
                                                                                                                                                                  }
                                                                                                                                                                  private float BASE_MIN = 1000;
                                                                                                                                                                  private float BASE_MAX = 1006;
                                                                                                                                                                  private int POW_MAX = 10;
                                                                                                                                                                  private int POW_MIN = 3;
                                                                                                                                                                  private static final File file = new File("BealsPossibleCounterExamples.txt");
                                                                                                                                                                  private static PrintStream pOStream=null;

                                                                                                                                                                  private void findSuitableC()
                                                                                                                                                                  {

                                                                                                                                                                      for(float iA=BASE_MIN; iA<BASE_MAX; iA++)
                                                                                                                                                                      {
                                                                                                                                                                          for(float iB=BASE_MIN; iB<BASE_MAX; iB++)
                                                                                                                                                                          {
                                                                                                                                                                              if(iB>iA && gcd(iA,iB)>1.0)
                                                                                                                                                                                  continue;
                                                                                                                                                                              for(float iC =BASE_MIN; iC < BASE_MAX; iC++)
                                                                                                                                                                              {
                                                                                                                                                                                  // Beal says if A^X+B^Y = C^Z then A,B,C have a common prime factor.
                                                                                                                                                                                  // if the gcd is one for each, it means they dont have a common prime factor.
                                                                                                                                                                                  if(gcd(iB,iC)==1 && gcd(iC,iA)==1 && gcd(iA,iB)==1)
                                                                                                                                                                                  {
                                                                                                                                                                                      // for all C's that dont have a common factor with A and B,
                                                                                                                                                                                      // run through values of X,Y and find a value for Z.
                                                                                                                                                                                      findZ(iA,iB,iC);
                                                                                                                                                                                  }

                                                                                                                                                                              }
                                                                                                                                                                          }
                                                                                                                                                                      }
                                                                                                                                                                      pOStream.flush();
                                                                                                                                                                      pOStream.close();
                                                                                                                                                                      //oStream.close();

                                                                                                                                                                  }

                                                                                                                                                                  private void findZ(float A, float B, float C)
                                                                                                                                                                  {
                                                                                                                                                                      for(int X = POW_MIN; X<POW_MAX; X++)
                                                                                                                                                                      {
                                                                                                                                                                          for(int Y = POW_MIN; Y<POW_MAX; Y++)
                                                                                                                                                                          {
                                                                                                                                                                              double sum =  Math.pow((double)A, (double)X)+Math.pow((double)B, (double)Y);
                                                                                                                                                                              double Z = (Math.log((double)sum)/Math.log((double)C));
                                                                                                                                                                              double epsillon = 10E-4f;
                                                                                                                                                                              if(isWithinRange(Z,epsillon))
                                                                                                                                                                              {   
                                                                                                                                                                                  String toPrint = ""+A+"^"+X+" + "+B+"^"+Y+" = "+C+"^"+Z+"\n";
                                                                                                                                                                                  pOStream.append(toPrint);
                                                                                                                                                                                  System.out.print(toPrint);
                                                                                                                                                                              }
                                                                                                                                                                          }
                                                                                                                                                                      }
                                                                                                                                                                  }
                                                                                                                                                                  private double gcd(double x, double y)
                                                                                                                                                                  {
                                                                                                                                                                      if (y==0) return x;
                                                                                                                                                                      return gcd(y,x%y);
                                                                                                                                                                  }

                                                                                                                                                                  private boolean isWithinRange(double z, double epsillon)
                                                                                                                                                                  {
                                                                                                                                                                      double ceil = Math.ceil((double)z);
                                                                                                                                                                      //float floor = Math.floor((double)z);
                                                                                                                                                                      if((ceil - z)<=epsillon )
                                                                                                                                                                          return true;
                                                                                                                                                                      else
                                                                                                                                                                          return false;
                                                                                                                                                                  }

                                                                                                                                                              }