10 Replies Latest reply on Jul 16, 2009 7:02 AM by titanius

    Need help converting to a brook+ kernel

    titanius
      output stream does not have a fixed indices but at random

      I have some trouble converting simple code to brook+. The problem is that the output stream is not incrementing from say 1 to 100 but at different indices. Is there a type of kernel that i can directly use?

       

      idmove, idmove_mapped and jdex_index are all 1D arrays

      CODE:

       for (j = ndstart; j<=ndend; ++j){
              if(idmove[j]==1){
                  idmove_mapped[jdex_index[j]]++;
              }
          }
       for (j = ndstart; j<=ndend; ++j){
              if(idmove[j]==0){
                  idmove_mapped[jdex_index[j]]--;
              }
          }

       

      I have the above code and i guess i can convert it to

      PROBABLE KERNEL:

      kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, out int idmove_mapped[])
      {
         
          if(idmove == 1)
              idmove_mapped[jdex_index] = idmove_mapped[jdex_index] + 1 ;
         
          if(idmove == 0)
              idmove_mapped[jdex_index] = idmove_mapped[jdex_index] - 1 ;
         
      }

       

      And use domain execution to get it working

          idmove_mapped_jdex.domainOffset(ndstart);
          idmove_mapped_jdex.domainSize(ndend-ndstart+1);
          idmove_mapped_jdex(s_jdex_index, s_idmove, s_idmove_mapped);

      Is this right? as it doesn't seem to work.

       

      Thanks for reading!

       

       

       

       

        • Need help converting to a brook+ kernel
          ryta1203

          out streams are uninitialized so idmove_mapped[..] has nothing in it, so you are essentially moving nothing to nothing. ALSO, it appears that you are using the same index, is there any reason you feel the need to use scatter and gather??

          Essentially, you need two streams for what you want to do: an input and an output (this is a STREAMING program model, the water doesn't flow backwards).

          so something more like:

          kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, int idmove_mapped_gather[], out int idmove_mapped_scatter[])

          {

          if (idmove==1)

          idmove_mapped_scatter[jdex_index] = idmove_mapped_gather[jdex_index]+1;

          if (idmove==0)

          idmove_mapped_scatter[jdex_index] = idmove_mapped_gather[jdex_index]-1;

          }

           

          BUT better yet why not:

          kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, int idmove_mapped_in<>, out int idmove_mapped_out<>

          {

          if (idmove==1)

          idmove_mapped_out= idmove_mapped_in+1;

          if (idmove==0)

          idmove_mapped_out= idmove_mapped_in-1;

          }

          ?? You are using a one to one mapping (aka the same variable for all indexes) so I'm not sure why you can't use streams here.

            • Need help converting to a brook+ kernel
              titanius

               

              Originally posted by: ryta1203 out streams are uninitialized so idmove_mapped[..] has nothing in it, so you are essentially moving nothing to nothing. ALSO, it appears that you are using the same index, is there any reason you feel the need to use scatter and gather??

               

               

              thanks much for the quick reply. I totally forgot about that (took a break from brook+ and forgot everything).

               

              Well the algorithm does something like this, when you have the below indices:

              idmove    jdex_index

              1               10

              1               20

              0               30

              1               20

              0               30

              And the end result in idmove_mapped is

              idmove_mapped[10]=1

              idmove_mapped[20]=2

              idmove_mapped[30]=-2  (if 0's idmove then negative, for 1's its positive)

              everything else set to 0.

               

              basically its counts of values in jdex_index into the idmove_mapped, negative if 0 in idmove and positive if 1 in idmove.

              In its original form i'll probably need sometype of syncthreads() so that the value are incremented nicely if i want to do something in parallel.

              So would this still be possible with stream or i have to rethink the algorithm?

                • Need help converting to a brook+ kernel
                  titanius

                  Well i got the kernel working and after setting the correct domain of execution. "Why doesn't the compiler complain if i sent a int instead of a uint4 as domainSize and domainOffset ! And the guide is sketchy in explaining domain of execution! "

                  Anyhow the "wrong" kernel:

                  kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, out int idmove_mapped_out[])
                  {
                      if(idmove == 1)
                          idmove_mapped_out[jdex_index]++;
                      if(idmove == 0)
                          idmove_mapped_out[jdex_index]--;
                  }

                  Is there someway to make the above and below work? All ways i can think of separating the input and output stream is way complicated.

                  kernel void do_something(int index<>, out idmove_mapped_out[])

                  {

                      idmove_mapped_out[index]++;

                  }

                   

                    • Need help converting to a brook+ kernel
                      ryta1203

                      Just like I said before...

                      You don't have to seperate them, I think you just need another copy (if I'm understanding correctly).

                        • Need help converting to a brook+ kernel
                          titanius

                          Thanks for the reply ryta1203.

                          The problem is that, a separate copy wouldnot help i think.

                          First, Code i using to call the kernel:

                              int nsample = 5;
                              int idmove[5]={1,0,0,1,0};
                              int jdex_index[5]={1,2,2,3,4};
                              int idmove_mapped_in[5]={0,0,0,0,0};
                              int ndstart=1;
                              int ndend=3;
                             
                              for(int i=ndstart;i<=ndend;i++)
                                  printf("%d ",idmove_mapped_in);
                              printf("\n");
                             
                              unsigned int u_nsample = nsample;
                              Stream <int> s_idmove(1, &u_nsample);
                              Stream <int> s_jdex_index(1, &u_nsample);
                              Stream <int> s_idmove_mapped_in(1, &u_nsample);
                             
                              streamRead(s_idmove, idmove);
                              streamRead(s_jdex_index, jdex_index);
                              streamRead(s_idmove_mapped_in, idmove_mapped_in);
                             
                              idmove_mapped_jdex.domainOffset(uint4(ndstart,0,0,0));
                              idmove_mapped_jdex.domainSize(uint4(ndend-ndstart+1,1,1,1));
                              idmove_mapped_jdex(s_jdex_index, s_idmove, s_idmove_mapped_in);
                                
                              streamWrite(s_idmove, idmove);
                              streamWrite(s_jdex_index, jdex_index);
                              streamWrite(s_idmove_mapped_in, idmove_mapped_in);

                           

                          I am not calling anything in a loop (i am depending on the kernel itself) so the older value (another copy) cannot be reused again and again in the kernel.

                          On pg.2-22 of the  Stream Guide, there is an example of random access write to a scatter stream. Is there any way to read-write to the same scatter stream? probably with some syncthreads so read-writes are atomized?

                          One example i can think that the r-w to same stream is the simplest way is counting the number of times a particular number is present in an array.

                            • Need help converting to a brook+ kernel
                              ryta1203

                              1. Someone else pointed out that you can use a scatter stream to write and read, I don't know if that's correct and it's not suggested if it is (from my understanding)... I believe there is some env var that needs to be set tough.

                              2. You don't need a whole other copy, just one for your kernel and you can pass in the same var twice, but again you have to set that env var in order to do that, otherwise you will get errors.

                                • Need help converting to a brook+ kernel
                                  titanius

                                   

                                  Originally posted by: ryta1203 1. Someone else pointed out that you can use a scatter stream to write and read, I don't know if that's correct and it's not suggested if it is (from my understanding)... I believe there is some env var that needs to be set tough.

                                   

                                  2. You don't need a whole other copy, just one for your kernel and you can pass in the same var twice, but again you have to set that env var in order to do that, otherwise you will get errors.

                                   

                                  Oh yeah i found that some BRT_ something write aliasing can be used to use both read and write to the same place. Though its highly advised to not use it as it might break stuff (but people say it works for 1D and 2D arrays).

                                   

                                  Originally posted by: hagen Yes, you should be able to write to and read from a scatter stream.  (Sometimes, I use this to creat a global buffer, since brook+ does not support local arrays.) 

                                   

                                  Cool this is a simple, efficient and more importantly uncomplicated way to expand to multiple streams and probably then use another kernel to sum up the row

                                  All other ways i thought about earlier were way more complicated!

                                   

                                  Thanks a lot ryta1203 and hagen ! Now i understand things much better.

                                   

                                   

                                   

                            • Need help converting to a brook+ kernel
                              hagen

                              Yes, you should be able to write to and read from a scatter stream.  (Sometimes, I use this to creat a global buffer, since brook+ does not support local arrays.)  See the following:

                               

                              kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, out int idmove_mapped[][])
                              {
                                  int id=instance().x;
                                  if(idmove == 1) idmove_mapped[jdex_index][id]++;
                                  if(idmove == 0) idmove_mapped[jdex_index][id]--;
                              }

                              main(){
                                int idmove<5>;
                                int _idmove[5];
                                int jdex_index<5>;
                                int _jdex_index[5];
                                int idmove_mapped<40,5>;
                                int _idmove_mapped[40][5];
                                int _idmove_mapped_reduced[40];
                                int i,j;

                                _idmove[0]=1; _jdex_index[0]=10;
                                _idmove[1]=1; _jdex_index[1]=20;
                                _idmove[2]=0; _jdex_index[2]=30;
                                _idmove[3]=1; _jdex_index[3]=20;
                                _idmove[4]=0; _jdex_index[4]=30;

                                for (i=0; i<40; i++) {
                                for (j=0; j<5; j++) {
                                  _idmove_mapped[j]=0;
                                }
                                }

                                streamRead(idmove_mapped,_idmove_mapped);
                                streamRead(idmove,_idmove);
                                streamRead(jdex_index,_jdex_index);

                                idmove_mapped_jdex(jdex_index, idmove, idmove_mapped);

                                streamWrite(idmove_mapped,_idmove_mapped);

                                for (i=0; i<40; i++) {
                                  _idmove_mapped_reduced
                              =0;
                                  for (j=0; j<5; j++) {
                                    _idmove_mapped_reduced+=_idmove_mapped[j];
                                  }
                                  printf ("%10d %10d \n",i,_idmove_mapped_reduced);
                                }

                              }

                               

                              Notice a few things in particular:

                              1. Obviously, you must initialize the accumulator array with zeros.

                              2. Multiple streams must not write to the same address in the global array.  This is why your original kernel didn't work.  (I guess writes to the same addressed aren't queued in the hardware?)

                              3. ryta1203 is right.  To do what you want, you need to recode it using streams.  The easiest way is shown above.  Create a separate output array for each stream by expanding the dimensionality of idmove_mapped to 2, with the second dimension equal to the number of threads, and now there are no write collisions.

                              4. The last loop in main() may be replaced by a reduce kernel.

                                • Need help converting to a brook+ kernel
                                  hagen

                                  And one more thing... there is no need to define domains.

                                    • Need help converting to a brook+ kernel
                                      hagen

                                      In my code above, occurances of were stripped by the forum and interpreted as italicize.  So here is the code again...

                                       

                                      kernel void idmove_mapped_jdex(int jdex_index<>, int idmove<>, out int idmove_mapped[][]) { int id=instance().x; if(idmove == 1) idmove_mapped[jdex_index][id]++; if(idmove == 0) idmove_mapped[jdex_index][id]--; } main(){ int idmove<5>; int _idmove[5]; int jdex_index<5>; int _jdex_index[5]; int idmove_mapped<40,5>; int _idmove_mapped[40][5]; int _idmove_mapped_reduced[40]; int i,j; _idmove[0]=1; _jdex_index[0]=10; _idmove[1]=1; _jdex_index[1]=20; _idmove[2]=0; _jdex_index[2]=30; _idmove[3]=1; _jdex_index[3]=20; _idmove[4]=0; _jdex_index[4]=30; for (i=0; i<40; i++) { for (j=0; j<5; j++) { _idmove_mapped[i][j]=0; } } streamRead(idmove_mapped,_idmove_mapped); streamRead(idmove,_idmove); streamRead(jdex_index,_jdex_index); idmove_mapped_jdex(jdex_index, idmove, idmove_mapped); streamWrite(idmove_mapped,_idmove_mapped); for (i=0; i<40; i++) { _idmove_mapped_reduced[i]=0; for (j=0; j<5; j++) { _idmove_mapped_reduced[i]+=_idmove_mapped[i][j]; } printf ("%10d %10d \n",i,_idmove_mapped_reduced[i]); } }