4 Replies Latest reply on Sep 29, 2008 1:59 PM by MicahVillmow

    about scatter/gather.

    Tomy
      I'm programming kernel which moves like gatherOP/scatterOP.

      Hi, All.

      I am programming kernel which moves like gatherOP/scatterOP.

      but, scatter kernel doesn't move well.

       

      example program:

      kernel void sub(float a<>, out float b<>{

                 b = a - 100;

      }

      kernel void gather(float test_index<>, float input1D[], out float gather_data<>{

                gather_data = input1D[test_index];

      }

      kernel void scatter(float test_index<>, float gather_data<>, out float4 outpu1D[]){

                output1D[test_index] = gather_data;

      }

      ......

                for(i=0;i<4;i++){

                           int index = i * Width;

                           test_index = index;

                }

      ......

                float stream_index<4>;

                float gather<4>;

                float temp<4>;

                float input1D<Length>;

               

                streamRead(input1D, input_data);

                streamRead(stream_index, test_index);

                gather(stream_index, input1D, gather);

                sub(gather, temp);

                scatter(stream_index,temp,input1D);

                streamWrite(input1D, input_data

      The result which executed this program is the next.

      ./test -x 4 -y 4

      input1D:

      0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00

      test_index:

      0 4 8 12

      gather:

      0.00 4.00 8.00 12.00

      temp:

      -100.00 -96.00 -92.00 -88.00

      input_data:

      -100.00 -100.00 -100.00 -100.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00

       

      Movement of scatter kernel is strange.

      I expect the following result.

      input_data:

      -100.00 1.00 2.00 3.00 -96.00 5.00 6.00 7.00 -92.00 9.00 10.00 11.00 -88.00 13.00 14.00 15.00

      But I don't knowthe cause of this problem.

      Do you know what this problem is?

      sorry, me poor english.

      thank you.

        • about scatter/gather.
          Ceq
          Hi Tomy, the results you get aren't wrong.

          You have defined input1D as a float stream (float1), however you're using it in a kernel like a float4 stream.
          As a float4 stream input1D becomes { ( 0 1 2 3), ( 4 5 6 7), ( 8 9 10 11), (12 13 14 15) }
          Now the scatter kernel performs:
          input1D[0] = temp[0] // Asigns -100 to the first 4 floats
          input1D[4] // There isn't such element, I'm not sure what happens internally with this assignation, looks like is omited.
          input1D[8] = ... // same case

          Scatter only works on 128 bit data types, so you can't just change the kernel to float1.

          You could convert your input streams to float4, a kernel like this should work (don't forget to change allocation size):

          kernel void f1tof4(float a<>, out float4 b<> ) { b = a; }

          You also can use masked writes:

          kernel void scatter(float i<>, float a<>, out float4 b[]) {
          float m = i % 4.0f;
          float d = i / 4.0f;
          if(m == 0) b[d].x = a; else
          if(m == 1) b[d].y = a; else
          if(m == 2) b[d].z = a; else
          if(m == 3) b[d].w = a;
          }

          -------------------------------------------------------------

          EDIT: To AMD, switch statement isn't supported, if this is normal don't forget documenting it.


            • about scatter/gather.
              Tomy

              Hi Ceq,  I'm sorry that a res is late.

               

              Thank you very much for your advice!

              it moved program by your advice.

              but, when the value of index is less than 4, and

              when the number of index is less than 4, that didn't move well.

              so, I settled this problem by the next program.

               

              kernel void scatter(float index<>, float input<>, out float4 out[]){

                                 float m = index%4.0f;

                                 float d = index/4.0f;

                                if(m==0 || d==0)           out[d].x=input;

                                else if(m==1 || d==0.25)  out[d].y=input;

                                else if(m==2 || d==0.5)    out[d].z=input;

                                else if(m==3 || d==0.75)   out[d].w=input;

              }

              but, when the value of index is 0, that didn't move well.

              example:

              ./test -x 8 -y 8

              input1D:

              0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00

              test_index:

              0.00 1.00 2.00 3.00 4.00

              gather_data:

              0.00 1.00 .2.00 3.00 4.00 -2.00

              temp:

              -100.0 -101.0 -102.0 -103.0 -104.0 -2.00

              input_data:

              -2.00 -101.0 -102.0 -103.0 -104.0 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00

               

              this problems cant't be settled.

              Do you know what this problem is?

               Thank you.

                • about scatter/gather.
                  Ceq
                  Well, I'm not sure but I think you should avoid several processors writing fields of the same float4
                  location, depending on the implementation it could lead to a race condition and undefined behaviour.
                  It would be better to change your streams and kernels so that only one thread writes each location.

                  On the other hand looks like the number of threads isn't determined by the index
                  nor the data stream, but the output (both input streams are resized). You can find
                  more about this topic here:

                  http://forums.amd.com/forum/me...id=328&threadid=98317
              • about scatter/gather.
                MicahVillmow
                Tomy,
                The problem here is that scatter/gather are 128 bit operations, i.e. every thread writes out 4 32-bit values with each index. So as ceq mentioned, you can possible have multiple threads writing to the same location. There is a way around this, but it requires write masking, which is show in the samples/language/IL/[gather|scatter]_IL samples. This however requires CAL instead of brook+.