6 Replies Latest reply on Mar 25, 2009 1:27 PM by gaurav.garg

    brook question

    rveldema
      brook+ kernel with different stream sizes

      Given a kernel with different parameter stream sizes,  I'd like my kernel to be called the number of times for the 1st (smaller) stream size. It looks almost as if currently its called for the largest stream?


      For example, given:
            kernel void   gpgpu_1_foo(float     iter_space<>,
                    int       looplen,
                    int       arr[],

                    out int   arrCopy[]) {   arrCoPy= arr + 3;  }

      where iter_space contains 32 elements and arr and arrCopy contain 64 elements, the kernel gets called 64 times currently (where I want it to be 32 times). I can see this by printing arrCopy in c++ after.

      I found that you can write (in c++):  "gpgpu_1_foo.domainSize(32)" but that seems to have no effect or it does something else...

      Cheers, Ronald.

       

       

        • brook question
          ryta1203

          Ronald.

          1. You don't even use iter_space in your kernel, so I'm not sure where the problem is. What exactly are you trying to do? Are you trying to copy the first 32 elements only? The last 32 elements? Some middle 32 elements?

          I'm going to assume the first 32 elements:

          1:

          kernel void foo(int looplen, int arr<>, out int arrCopy<> { if (instance().x < 32) {arrCopy = arr+3;} }

          2:

          You should be able to call domain() when you call the kernel instead.

           

            • brook question
              rveldema

              Ok, I threw away a bit too much context. What i'm trying to do is automatically recognize brook+-able loops  in my (C like) compiler and from that generate proper brook+ code.

              For example, when I see "for (int i = 0; i<32;i++) { outarr = arr + 3; }" I can generate brook+ code fairly easily. For performance reasons I allocate arr/outarr on the GPU and only do stream.read/write if outputs are really needed by the CPU. So-far-so-good. However, I allocate outarr/arr on the GPU in the size that the programmer wanted it, for example, 1000 elements on the GPU.  Now the brook+ kernel should execute 32 times instead of 1000 times. I can't allocate the streams smaller as later on there can/will/should be a loop that iterates to a larger/different/smaller range on the same array.

              My 1st try was a dummy stream with 32 elements. This didn't work as brook+ somehow chooses the largest stream.

              My 2nd try was what you proposed in your alternative (1). This works fine, except that we waste (1000-32) GPU kernel invocations/threads. This is unacceptable and *should* be easily avoidable.

              My 3rd try was, when generating a kernel 'foo', before calling 'foo', do "foo.domainSize(32)", this doesn't work, it still executes the kernel 1000 times. Streams have a 'domain' method, Calling 'domain(0,32)' on all streams has no effect, just as doing 'foo.domainSize(32)' has no effect.

              For completeness, I've appended my test program. Whatever I do, I get 64 instead of 32 kernel calls.

              ----------------------------------- runner.br


              kernel void   gpgpu_1_foo(int     iter_space<>,
                            int       looplen,
                            int       arr[],
                            out int   arrCoPy[])
              {
                int i = (int) indexof(iter_space).x;
                arrCoPy= arr + 3;
              }


              ---------------------------- main.cc


              int main(int argc,
                   char **argv)
              {
                unsigned rank = 1;

                unsigned elts1 = 32;
                unsigned streamSize1[] = {elts1};
                brook::Stream<int> inputStream1(rank, streamSize1);

                unsigned elts2 = 64;
                unsigned streamSize2[] = {elts2};
                brook::Stream<int> inputStream2(rank, streamSize2);
                brook::Stream<int> outputStream1(rank, streamSize2);

                int input[elts2];
                memset(input, 0, sizeof(input)); 

                inputStream1.read(input);
                inputStream2.read(input);

              #if 0
                gpgpu_1_foo.domainSize(32);
              #endif

              #if 0
                inputStream1.domain(0, 32);
                inputStream2.domain(0, 32);
                outputStream1.domain(0, 32);
              #endif

                gpgpu_1_foo(inputStream1, // iterspace
                        elts1,//int looplen,
                        inputStream2,//int arr[],
                        outputStream1);//out int arr__CoPy[],
               
                int result[elts2];
                outputStream1.write(result);
               
                // lets see how many kernel invocations we did:
                int count = 0;
                for (unsigned i=0;i<elts2;i++) {
                  if (result) {
                    count++;
                  }
                }
                printf("we did %d kernel invocations\n", count);
                return 0;
              }

                • brook question
                  rveldema

                  ps, a shame that  \[ and \] characters are filtered out by the web system in replies so the example looks a little strange (meaning ascii chars 133 and 135).

                  • brook question
                    Gipsel

                     

                    Originally posted by: rveldemaMy 1st try was a dummy stream with 32 elements. This didn't work as brook+ somehow chooses the largest stream.


                    As I understand the documentation, Brook should use the size of the (largest) output stream and automatically scale all other streams accordingly.

                     

                    Originally posted by: rveldemaMy 3rd try was, when generating a kernel 'foo', before calling 'foo', do "foo.domainSize(32)", this doesn't work, it still executes the kernel 1000 times. Streams have a 'domain' method, Calling 'domain(0,32)' on all streams has no effect, just as doing 'foo.domainSize(32)' has no effect.


                     

                    Both methods work perfectly well here.

                    But you have to use uint4 values for the domainOffset and domainSize  methods of the kernel (have no idea why it even compiles otherwise). So you should call it like

                    foo.domainOffset(uint4(0,0,0,0)); // would be optional for this simple case

                    foo.domainSize(uint4(32,1,1,1)); // unused dimensions have to be 1 not zero!

                    foo(arguments);



                      • brook question
                        rveldema

                        Just tried your suggestion, it still doesn't have any effect. I'm guessing this is a bug in 1.4 (if you're not running 1.4 also that is).

                        R.

                         

                          • brook question
                            gaurav.garg

                            If you don't set domain of execution using domainOffset and domainSize, number of threads are decided by first output stream (or scatter stream). Also if your input stream (not gather) dimensions doesn't match with first output stream(not scatter), they are resized to match the output stream dimensions.

                            Also, domain operator creates a new stream, but you are not storing this new stream in a variable. I would suggest you to use domainOffset and domainSize operator.

                            Also, as a sidenote scatter is very slow compared to regular output stream, I would suggest you to change your code to this -

                            kernel void   gpgpu_1_foo(int     iter_space[],
                                          int       looplen,
                                          int       arr[],
                                          out int   arrCoPy<>
                            {
                              int i = (int) instance().x;
                              arrCoPy
                            = arr+ 3;
                            }