19 Replies Latest reply on Jun 23, 2010 2:32 AM by niravshah00

    How to return results from kernel?

    niravshah00

      I have kerne running with and want to return a set of six integers back to the host code.

      Well i dont know how many set i would be getting in advance . It is possible that i might not get any.

      Any thoughts on how can i do this i am attaching the kernel code here

      kernel void threadABC(int startRange,out int a<> )
      {
          int X,Y,Z;
          int A,B,C;
          int gcdAB,gcdAC,gcdBC;
          float N = 4093.0f;   
         
          //using the index of the output stream as the values for A,B,C
          A = instance().x+startRange;
          B = instance().y+startRange;
          C = instance().z+startRange;
         
          // intialising a to 0 so that when we filter the reuslts we can know that 0 means that
          //location does not have a reuslt.
          a=0;

           gcdAB = findGcd(A,B);
           gcdAC = findGcd(A,C);
           gcdBC = findGcd(B,C);
          if(gcdAB==1 && gcdAC==1 && gcdBC==1){
              for( X = 3; X < 10; X++)
              {
                  for( Y = 3; Y < 10; Y++)
                  {
                      for( Z = 3; Z < 10; Z++)
                      {
                          float sum =  modulusPower((float)A,X)+modulusPower((float)B, Y);
                          float cpowerZ = modulusPower((float)C,Z);
                          sum = fmod(sum,N);
                          if(cpowerZ == sum){
                              // here the possible solution should be stored and returned to host code
                              //have to figure out the way to return the values of A,B,C,X,Y,Z to host
                            
                          }
                         
                      }
                  }
              }

          }

        • How to return results from kernel?
          niravshah00

          Is this question so difficult or is it so stupid ?
          I don't get it

            • How to return results from kernel?
              Ceq

              Hi niravshah00, I would like to help you, but I'm afraid that there is no way to directly implement what you want in Brook+. Even in OpenCL your problem looks complicated (that's why it is an unproved theorem).

              Computing the values can be easy, but returning them requires some kind of variable length queue or list, which is not possible in Brook+. You can at most return a few solutions per thread (up to eight float4 values). Maybe if you could use RV870 append/consume buffers there would be a way, but I think you should look for another way.

               

              There is something that may be helpful though:

              - Create a big 2D stream, for example 4096x4096. Each thread of your kernel will be assigned an unique identifier and an output address.

              - Using the thread identifier assign to each thread a subdomain of the original problem (not a single element), proportional to the problem domain divided by the output stream size (you can use kernel scalar parameters to configure the offset and size of the execution).

              - Instead of storing the solutions in the thread's output you store whether there is solution or not in each sub-somain. Now you have a stream that contains if a particular sub-domain has a solution or not, you can save or graph it.

              - If you want more detail about a sub-domain that has solutions according to the output stream, you can use the CPU or launch again your GPU kernel with that sub-domain, which will be divided in smaller portions again.

               

              I think that using this workaround you could have a GPU implementation that can be used to analyze large domains in much less time than using the CPU alone.

                • How to return results from kernel?
                  Jawed

                  Create 6 output buffers instead of one, each holding a single integer. See attached code.

                  When you want to get fancy you can pack these outputs. e.g. make one output stream, called XYZ and another output stream called ABC and define both as int3. Then you can do this

                  XYZ.x = X ;
                  XYZ.y = Y ;
                  XYZ.z = Z ;
                  ABC.x = A ;
                  ABC.y = B ;
                  ABC.z = C ;

                  The advantage of this style of packing is it allows you to return more data (should you need to). Brook+ has a limit of 32 integers or floats, packed into 8 output streams. So in this alternative code I have set up 2 output streams, each containing 3 values.

                  Jawed

                  kernel void threadABC(int startRange, out int x<>, out int y<>, out int z<>,out int a<>,out int b<>,out int c<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; float N = 4093.0f; //using the index of the output stream as the values for A,B,C A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; // intialising a to 0 so that when we filter the reuslts we can know that 0 means that //location does not have a reuslt. x=y=z=a=b=c=0; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { for( Z = 3; Z < 10; Z++) { float sum = modulusPower((float)A,X)+modulusPower((float)B, Y); float cpowerZ = modulusPower((float)C,Z); sum = fmod(sum,N); if(cpowerZ == sum){ // here the possible solution should be stored and returned to host code //have to figure out the way to return the values of A,B,C,X,Y,Z to host x = X ; y = Y ; z = Z ; a = A ; b = B ; c = C ; } } } } } }

                    • How to return results from kernel?
                      niravshah00

                      Hi  Jawed,

                      Thanks for the reply.

                      The solution you suggested has a problem.

                      Now in my kernel code the stream 'a' is a 3d stream which serves as a values for A,B,C  and for the other three variables i use nested loop.

                      Basically each instance of stream 'a' is a thread that would loop over  for values of X,Y,Z

                      So now i possible instance of A,B,C would have many possible solution.

                      for example say for A=1000,B=1001,C=1002

                      there might me solution like X=2,Y=3,Z=4
                                                                   X=3,Y=4,Z=5

                      so this solution will be written on the same location.

                      for the same instance of A,B,C we have 2 solutions.

                      I hope i have explained this clearly

                      There has to be a work around for this.

                      Thanks very much for the reply.

                      Hopefully you will reply again

                       

                        • How to return results from kernel?
                          Jawed

                          If you're expecting multiple solutions for each element in the domain of execution then one approach is to launch the kernel multiple times.

                          On the first launch, the kernel returns the first solution it finds for each element in the domain of execution.

                          On the second launch the kernel returns the second solution. etc. Simply pass in a constant to the kernel which tells it how many solutions it should skip, before finally returning.

                          So the first kernel launch would have skip=0, second call would have skip=1 etc.

                          If you want to get clever than you make the domain of execution index into a stream which contains the explicit values of A, B, C that you want to evaluate (i.e. an input stream: uint3 evaluateABC<>). Use this instead of the instance() based technique in that earlier kernel.

                          This way the second kernel launch runs on a smaller domain (presuming that some elements in the first kernel's domain had no solution). And if a third launch is required, that will have an even smaller domain. etc. Just keep going until there are no more solutions.

                          Jawed

                            • How to return results from kernel?
                              niravshah00

                              Hi Jawed ,

                              Can you give me an example i could not find any such examples in the samples in the sdk.

                              It would be great if I get this working .

                               

                              Thanks

                              Nirav

                                • How to return results from kernel?
                                  Jawed

                                  My first suggestion is so simple it's trivial and the second suggestion has no effect on the kernel.

                                    • How to return results from kernel?
                                      niravshah00

                                      Well by first solution u mean the one with 6 output streams?

                                      I did not understand the second solution  with multiple kermels

                                      i will send u my host code as well so that u have a better idea.

                                      host code:

                                      #include "brookgenfiles\beals.h"
                                      #include "conio.h"
                                      #include "brook\stream.h"
                                      using namespace brook;

                                      int main(int argc, char ** argv)
                                      {
                                         
                                          int i,j,k,range;   
                                          int startRange =1000;
                                          int endRange = 10000;

                                          unsigned int dim[] = {10,10,10};
                                         

                                         
                                          for(i=0;i<(endRange - startRange)
                                          {
                                              if((endRange - startRange-i)<8192)
                                                          dim[0] = endRange - startRange-i;
                                                      else
                                                          dim[0] = 8192;
                                              for(j=0;j<(endRange - startRange)
                                              {
                                                  if((endRange - startRange-j)<90)
                                                          dim[1] = endRange - startRange-j;
                                                      else
                                                          dim[1] = 90;
                                                  for(k=0;k<(endRange - startRange)
                                                  {
                                                     
                                                      if((endRange - startRange-k)<90)
                                                          dim[2] = endRange - startRange-k;
                                                      else
                                                          dim[2] = 90;           
                                                      Stream<int>  aStream(3,dim);
                                                      threadABC(startRange+i,startRange+j,startRange+k,aStream);
                                                      // results from the kernel to be written to a file
                                                      // want to do this in parallel
                                                      k+=90;
                                                  }
                                                  j+=90;
                                              }
                                              i+=8192;
                                          }


                                         
                                          //display the result
                                          //streamWrite(aStream,solution);
                                         
                                          /*for(i=0;i<10;i++)
                                              for(j=0;j<10;j++)
                                                  for(k=0;k<10;k++)
                                                  {
                                                      //check for non zero values since the stream is intialized to zero.
                                                      if(solution[j][k]!=0)
                                                      printf("a =%d,b =%d,c =%d,z =%d\n" ,i+1000,j+1000,k+1000,solution
                                      [j][k]);
                                                  }*/
                                          getch();
                                          return 0;
                                      }

                                        • How to return results from kernel?
                                          niravshah00

                                          Hi Jawed ,

                                          There is one more way to deal with my problem but I am not sure is that possible and if it is how to do it.

                                          Instead of creating threads for each value of A,B,C if some how i could create a thread for A,B,C,X,Y,Z then it would be great and then i don't have to deal with multiple solution within single thread.

                                          Hoping that you would reply.

                                           

                                            • How to return results from kernel?
                                              Ceq

                                              Hi niravshah00, I don't think that creating a thread for each varaible would solve anything because you still have to perform a recombination step. If you want to do it anyway you could do something like this:

                                              kernel void
                                              thread6D(int3 baseABC, int3 baseXYZ, ... )
                                              {
                                                  int2 pos2D = instance( ).xy;


                                                  // Identifiers for ABC
                                                  int idA = baseABC.x + (pos2D.x & 0x000f );
                                                  int idB = baseABC.y + (pos2D.x & 0x00f0 >> 4 );
                                                  int idC = baseABC.z + (pos2D.x & 0x0f00 >> 8 );


                                                  // Identifiers for XYZ
                                                  int idX = baseXYZ.x + (pos2D.y & 0x000f);
                                                  int idY = baseXYZ.y + (pos2D.y & 0x00f0 >> 4 );
                                                  int idZ = baseXYZ.z + (pos2D.y & 0x0f00 >> 8 );
                                                 
                                                  // Now, if you use a 4096x4096 texture you have a small 6D domain
                                                  // of 16A x 16B x 16C x 16X x 16Y x 16Z threads
                                                  // You can decompose the domain in several subdomains, you just have
                                                  // to launch several executions changing baseABC and baseXYZ

                                              ...   

                                              }

                                                • How to return results from kernel?
                                                  niravshah00

                                                  Hi Ceq,

                                                  Well what i was suggesting was if i could make a 6D stream instead of a 3D stream which i am making right now then per  thread so i could do something like

                                                     a = 1 ;     (a is my 6d stream and then the position would give me the values of the variables)

                                                  i.e  if( a[j][k][x][y][z]  == 1) then i,j,k,x,y,z is the solution


                                                  Well then tell me what is the best solution for this problem i have run out of option and time .

                                                  please tell me

                                                    • How to return results from kernel?
                                                      Ceq

                                                      I think the dimension of the stream isn't that relevant. As you know, you can store a 2D matrix using the memory allocated with a single malloc, all you have to do are the proper index translations.

                                                      I think that Brook+ does not support 6D streams, but even using 3D streams may result in a performance penalty as slow automatic address translation code will be generated. However, manipulating the index values, you can perfectly store 6D data on a 2D texture.

                                                      Note that as the whole domain is too big, I think you should divide it in small blocks that can be processed independently. That's why probably you'll need a parameter with some kind of offset in your kernel.

                                                      I would like to be more helpful, but currently I'm quite busy on a work due to a deadline, so for moment this is all I can do, sorry.

                                    • How to return results from kernel?
                                      niravshah00

                                      Thanks Ceq

                                      Well i don't completely understand your suggestion but let me spend some time on what you have suggested and then get back to you.

                                      I really appreciate your help.

                                      I will get back to you in 2 days.

                                       

                                       

                                  • How to return results from kernel?
                                    MicahVillmow
                                    variable length arrays are not supported in OpenCL. This is specified in 6.8.d of the OpenCL spec.
                                      • How to return results from kernel?
                                        niravshah00

                                        Well what I meant was i need the kernel to give me a array with result but i would also want to know how many results are there in the array.

                                        So that I need not scan the whole array on the host code to get the result .

                                        Some counter which each thread could access and update when it finds a solution and ofcource acquire a lock as well on the counter.

                                        If u have this in brook+ that would be great .Something like global memory