7 Replies Latest reply on Apr 10, 2009 4:40 AM by arros123

    performance on gather&scatter

    tgm@ncic.ac.cn

      The flowing MD function uses both gather and scatter. I found that the performance on HD4870 is extremely poor. The keyboard/mouse even is inactive for several seconds. Why?

      In one test case:   N_vec4=int4(10000,2500,96,100000),ng_vec4=int4(5,5,17,0),  the size of stream pos<> and tag[] is 10000, size of stream bucket[] is 40800, size of ne[] is 26, size of nnlist is 25920000

      kernel void streamNeigh(
          int4 N_vec4, int4 ng_vec4,
          float4 pos<>,
          int4 tag[], int bucket[], int4 ne[],
          out int nnlist[]
          )
      {
        int i, j, k;
        int ix, iy, iz;
        int id;
        int x, y, z;
        int a, na, o;
        int pnt, boff;
        int ind = instance().x;
        int stride = N_vec4.y+1;
        int offset = ind * stride;
        float4 p = pos;

        pnt = 0;
        a = tag[ind].y;
        iy = a%ng_vec4.y;
        ix = (a/ng_vec4.y)%ng_vec4.x;
        iz = a/ng_vec4.x/ng_vec4.y;
        k = tag[ind].z;
        boff = a*N_vec4;
        if (k < N_vec4.z && tag[ind].x != -1) {
          for (j = 0; j < N_vec4.z; j+=1) {
            if (j != k && (id = bucket[boff+j]) != -1) {
              if (tag[id].x != -1) {
                nnlist[offset+1+pnt] = id;
                pnt+=1;
              }
            }
          }
          for (j = 0; j < 26; j+=1) {
            x = ix + ne[j].x;
            y = iy + ne[j].y;
            z = iz + ne[j].z;
            na = (z+ng_vec4.z)%ng_vec4.z*ng_vec4.x*ng_vec4.y+(x+ng_vec4.x)%ng_vec4.x*ng_vec4.y+(y+ng_vec4.y)%ng_vec4.y;
            boff = na * N_vec4.z;
            for (o = 0; o < N_vec4.z; o+=1) {
              if ((id = bucket[boff+o]) != -1) {
                if (tag[id].x != -1) {
                  nnlist[offset+1+pnt] = id;
                  pnt+=1;
                }
              }
            }
          }
        }
        nnlist[offset] = pnt;
      }

        • performance on gather&scatter
          farukh

          I have written a scatter kernel using Brook+ and the performance is so poor. I am only able to get double speedup.

          Here is my kernel

           

          kernel void RemoveUnshockedNameKernel(

          int total_timesteps , // = 2

          int total_gauss,  // = 16

          int total_names,  // = 128

          int result_size_max_total_loss, // = 780

          int gpayout_0[32][128], 

          int gpayout_1[32][128],

          float gprobs_1[32][128],

          float Fprobs[32][780],

          out float Hprobs[8192][780]

             {

            int2 index = instance().xy;

            int hindex = index.y ;

            int findex = ....;

            for (unsigned int i = 0; i < 780; i++) {

               // some FP operations here ...

               Hprobs[hindex]

          = Fprobs[findex] * ....;

           }

          }

           

           

           

           

           

          I am calling this kernel with exec domain as follows:









           

           RemoveUnshockedNameKernel.domainOffset(uint4(0, 0, 0, 0));

          RemoveUnshockedNameKernel.domainSize(uint4(1, 8192, 1, 1));

           

           

          I am using sdk 1.4 and 4870X2. Result: 

          CPU time = 1.5 and GPU time = 0.82 with a speedup of 1.9

          This speedup on a 4870X2 is extremely low. If anybody from AMD team help me figure out if I am doing anything wrong I would really appreciate it.

          I don't want to regret my decision of going with AMD ATI.



            • performance on gather&scatter
              Ceq

              You're using gather and scatter at the same time, that looks like the traditional programming model, not streaming.

              In Brook+ you should try to use streaming as much as possible, because it has performance advantage over both gather and scatter.

              In my experience using gather is about two or four times slower than streams (depends on the number of gather streams and the access pattern). Using scatter is much slower since it performs uncached writes, so I try to avoid it as much as possible.

              Usually you can rewrite a scatter kernel to use gather streams only, even if it requires several kernels it could be faster. You could even use a reorder kernel if you need to.

              Also you should try to avoid branching where possible, as if threads in the same block diverge execution it could become twice slower.

              For more information you can read the first chapter of the user guide in your BROOKDIR/doc.

              Hope that helps.

                • performance on gather&scatter
                  farukh

                  Hi Ceq and ryta1203

                   

                  Thanks for your replies and suggestions.

                  I have couple of questions though:

                  1. what is a reorder kernel?

                  2. In my kernel I simplified the code a little bit. Actually I have a do while loop around the for loop that makes it difficult to break the logic to use stream output as follows:

                  do {

                  error = 0;

                  for (unsigned int i = 0; i < 780; i++) {

                       // some FP operations here ...

                       Hprobs[hindex]

                  = Fprobs[findex] * ....;

                    error += ....;

                  }

                  } while (error < some_number);

                  Do you think if I make the inner part of the for loop as a out stream kernel and code the do while loop in the CPU, it would perform better ? (I am going to give it a shot anyway and post my results)

                   

                  Thanks again to both for your replies.

                   

                  Regards.

                    • performance on gather&scatter
                      ryta1203

                       

                      Originally posted by: farukh Hi Ceq and ryta1203

                       

                      Do you think if I make the inner part of the for loop as a out stream kernel and code the do while loop in the CPU, it would perform better ? (I am going to give it a shot anyway and post my results)

                       

                       

                       

                      Thanks again to both for your replies.

                       

                       

                       

                      Regards.

                       

                       

                      Yes. Getting rid of the scatter is going to increase performance. Since your array index is different you will probably still have to use gather OR reorder your Fprobs before calling the kernels. I'm not sure what your kernel code looks like,  you haven't posted much of it.

                      Also, as far as I am aware the "instance()" returns the domain of the first output, so using to access Fprobs might get odd results.

                       

                       

                       



                       

                       

                       



                       

                       

                       

                       



                       

                       

                       



                       

                       

                       

                       



                       

                        • performance on gather&scatter
                          Ceq

                          1. It isn't any special kernel, just a gather kernel that changes the order of your elements. For example you can use it after a streaming kernel and it would be similar to performing a scatter.

                          2. As Ryta said is hard to tell from that code. Note that to use the hardware efficiently your input vectors should be about 10000 elements. Otherwise the hardware may not be able to hide your gather latencies properly.

                          • performance on gather&scatter
                            arros123

                            I assume you are using XP for your development. Brook+ uses one thread local variable in the generated CPP and XP (not Vista or Linux) has limitation on using thread local variable in a dll and calling LoadLibrary() on this dll (more information at http://msdn.microsoft.com/en-us/library/ms684175(VS.85).aspx).

                            There are two possible solution of this issue. You can choose any one of them-

                            1. Instead of loading libraries at runtime, make them link time dependent.
                            2. Change line 48 of KernelInterface.h (under $(BROOKROOT)\ sdk\include\brook) from #define __THREAD__ __declspec(thread) to #define __THREAD__ and rebuild your application or dll (no need to build Brook+ source). This change should not affect your application until your _application is calling the same kernel from multiple threads.

                             

                            Edit:Removed Advertising from the post

                    • performance on gather&scatter
                      ryta1203

                       

                      Originally posted by: tgm@ncic.ac.cn The flowing MD function uses both gather and scatter. I found that the performance on HD4870 is extremely poor. The keyboard/mouse even is inactive for several seconds. Why?

                      In one test case:   N_vec4=int4(10000,2500,96,100000),ng_vec4=int4(5,5,17,0),  the size of stream pos<> and tag[] is 10000, size of stream bucket[] is 40800, size of ne[] is 26, size of nnlist is 25920000

                      kernel void streamNeigh(     int4 N_vec4, int4 ng_vec4,     float4 pos<>,     int4 tag[], int bucket[], int4 ne[],     out int nnlist[]     ) {   int i, j, k;   int ix, iy, iz;   int id;   int x, y, z;   int a, na, o;   int pnt, boff;   int ind = instance().x;   int stride = N_vec4.y+1;   int offset = ind * stride;   float4 p = pos;

                        pnt = 0;   a = tag[ind].y;   iy = a%ng_vec4.y;   ix = (a/ng_vec4.y)%ng_vec4.x;   iz = a/ng_vec4.x/ng_vec4.y;   k = tag[ind].z;   boff = a*N_vec4;   if (k < N_vec4.z && tag[ind].x != -1) {     for (j = 0; j < N_vec4.z; j+=1) {       if (j != k && (id = bucket[boff+j]) != -1) {         if (tag[id].x != -1) {           nnlist[offset+1+pnt] = id;           pnt+=1;         }       }     }     for (j = 0; j < 26; j+=1) {       x = ix + ne[j].x;       y = iy + ne[j].y;       z = iz + ne[j].z;       na = (z+ng_vec4.z)%ng_vec4.z*ng_vec4.x*ng_vec4.y+(x+ng_vec4.x)%ng_vec4.x*ng_vec4.y+(y+ng_vec4.y)%ng_vec4.y;       boff = na * N_vec4.z;       for (o = 0; o < N_vec4.z; o+=1) {         if ((id = bucket[boff+o]) != -1) {           if (tag[id].x != -1) {             nnlist[offset+1+pnt] = id;             pnt+=1;           }         }       }     }   }   nnlist[offset] = pnt; }

                       

                      This looks like you took your CPU code and just wrapped it in a kernel code and expected it work well on GPU.

                      Ceq is right, in order to get good performance from Brook+ you need to use a Streaming model as much as possible. You also use lots of branching, which is generally bad since if one thread in the wavefront takes one path then each thread must take that path, since it's a SIMD (Single Instruction Multiple Data) engine.