3 Replies Latest reply on Aug 4, 2008 2:35 PM by ryta1203

    number of threads in scatter kernels

    josopait

      Hi,

       

      I noticed that the number of threads is always determined by the size of the output stream, even if the output stream is a scatter stream.

       

      For instance, consider the following:

       

      kernel void foo(float4 a<>, out float4 b[])
      {
          ...
      }

      int main()
      {
          float4 a<10>;
          float4 b<100>;

          foo(a, b);
      }

       

      The kernel foo is called with 100 threads, one for each element of b. I find this a rather odd behavior. I would find it much more natural if the number of threads is determined by the size of the input streams in such cases.

       

      I want to perform operations on a large matrix (say, a 10x10 matrix that is provided as scatter stream b, if we stick to the above example). Because the matrix elements partially depend on one another, I cannot make use of 100 threads, but I only want 10 threads, one for every row. The way I am doing this now is to get the thread number from the output stream and return immediately if the number is too large, like so:

      kernel void foo(float4 a<>, out float4 b[])
      {
          int task = indexof(b);
          if (task >= 10)
          {
              return;
          }

          << perform calculations on row 'task' >>

      }

       

      This seems a bit silly. Is there a better way to specify the number of threads?

       

      Ingo

        • number of threads in scatter kernels
          ryta1203
          I agree that this is rather limiting. It would be great if you could specify an index input stream such that the size of the index stream would be the thread size and that all threads would run over that stream, not the output stream (for indexof purposes).
            • number of threads in scatter kernels
              ryta1203
              Micah,

              So in that example, "a"'s index is whatever the index is of c, not of itself since the domain of execution runs over c, correct?

              So if C was running 100 threads, then for each kernel call from 0 to 99, it would be a[0] to a[99], respectively, right?

              Does this present a problem when having an output stream larger than the input stream and trying to go from the input stream to the output stream such that you want the output streams index of assignment to be much larger than the input's index.

              For example, in CPU code:

              for(i=0;i<128;i++)
              for(j=0;j<128;j++)
              for (k=0;k<9;k++)
              out[i+128*j+128*128*k] = in[i+128*j];