5 Replies Latest reply on Nov 8, 2009 7:07 PM by eduardoschardong

    How to optimize the kernel with Brook+

    licoah

      I has optimized this kernel. But the performance is not very good.

      Are there some special tricks in Brook+, which I have not used for this kernel?

      kernel void
      kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize,  int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,float2 dataIn[][], float2 WsI[][], out float2 dataOut<>{

          float2 res = float2(0.0f,0.0f);
          int2 pos = instance().xy;
          float2 w1,w2,w3,w4,x1,x2,x3,x4;
          int Y = pos.y / 4;
          int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;
          int cntG = Y / gSize;
          int cntAF = Y - gSize * cntG;
          int cntCha = X / nCol;
          int cntP = X%nCol; //X - cntCha*nCol;
          int dataN = nChapSize; // number of source samples
          int Widx, Inputidx;
          int k = 0;

          //compute start index in weights matrix
          Widx = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******

          //compute start index in input matrix
          if(cntG >= firstToSkip)cntG = cntG + SkipLines;
          Inputidx = nCha * (cntG - halbpSize + 1);


          //scalar product
          while(k < dataN){
              w1 = WsI[cntP][Widx];
              Widx += 1;
              w2 = WsI[cntP][Widx];
              Widx += 1;
              w3 = WsI[cntP][Widx];
              Widx += 1;
              w4 = WsI[cntP][Widx];
              Widx += 1;
              x1 = dataIn[cntP][Inputidx];
              Inputidx += 1;
              x2 = dataIn[cntP][Inputidx];
              Inputidx += 1;
              x3 = dataIn[cntP][Inputidx];
              Inputidx += 1;
              x4 = dataIn[cntP][Inputidx];
              Inputidx += 1;
              res.y += w1.y * x1.x + w1.x * x1.y + w2.y * x2.x + w2.x * x2.y + w3.y * x3.x + w3.x * x3.y + w4.y * x4.x + w4.x * x4.y;
              res.x += w1.x * x1.x - w1.y * x1.y + w2.x * x2.x - w2.y * x2.y + w3.x * x3.x - w3.y * x3.y + w4.x * x4.x - w4.y * x4.y;
              k += 4;
          }

          dataOut =  res;

      }

        • How to optimize the kernel with Brook+
          MicahVillmow
          first, use a float4 scatter instead of a float2, this reduces the number of reads that you need by a factor of two.
          Second, use vector math when possible and swizzles instead of using a bunch of scalar math.
          w1 = WsI[cntP][Widx];
          Widx += 1;
          w2 = WsI[cntP][Widx];
          Widx += 1;
          w3 = WsI[cntP][Widx];
          Widx += 1;
          w4 = WsI[cntP][Widx];
          Widx += 1;
          should be:
          w1 = WsI[cntP][Widx];
          w2 = WsI[cntP][Widx + 1];
          w3 = WsI[cntP][Widx + 2];
          w4 = WsI[cntP][Widx + 3];
          Widx += 4;

          //compute start index in input matrix
          if(cntG >= firstToSkip)cntG = cntG + SkipLines;

          Can be generated as:
          cntG = cntG + (SkipLines * (int)(cntG >= firstToSkip))

          Finally, don't use division/modulus unless you absolutely have to.
          • How to optimize the kernel with Brook+
            eduardoschardong

            Hi licoah,

            I played a little with your code, but focusing more on the main loop, other than the tips Micah already give I have a few more:

            1) When using brook+ in PS mode (always you don't put an [Attribute(GroupSize())} in the kernel it will do fetchs by sampling textures, sampling expect floats as parameters (in fact, float2), if you pass an int it will have to convert from int to foat and only the T unit does that, CS expects int.

            2) By being float2 it will generate MOVs if not in the same register, but it is easy to solve.

            3) It's possible for you to change the data layout? To me using a pair {X, Y} of float4 instead of 4 float2 seems ok.

            One last thing, how slow it is? What's the input data look like? How large streams are?

            Here a piece of my code, if all work shloud perform twice as fast:

             

            kernel void kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize, int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines, float4 dataInX[][], float4 dataInY[][], float4 WsIX[][], float4 WsIY[][], out float2 dataOut<>) { float2 res = float2(0.0f,0.0f); int2 pos = instance().xy; float4 xX,xY,wX,wY; int Y = pos.y / 4; int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x; int cntG = Y / gSize; int cntAF = Y - gSize * cntG; int cntCha = X / nCol; float cntP = X%nCol; //X - cntCha*nCol; float dataN = nChapSize/4; // number of source samples float4 k = 0; //compute start index in weights matrix k.y = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv******* //compute start index in input matrix if(cntG >= firstToSkip)cntG = cntG + SkipLines; k.z = nCha * (cntG - halbpSize + 1); k.w = cntP; while(k.w < dataN){ wX = WsIX[k.wy]; wY = WsIY[k.wy]; xX = dataInX[k.wz]; xY = dataInY[k.wz]; res.y += dot(wY, xX) + dot(wX, xY); res.x += dot(wX, xX) - dot(wY, xY); k.xyz += 1; } dataOut = res; }