cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

licoah
Journeyman III

How to optimize the kernel with Brook+

I has optimized this kernel. But the performance is not very good.

Are there some special tricks in Brook+, which I have not used for this kernel?

kernel void
kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize,  int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,float2 dataIn[][], float2 WsI[][], out float2 dataOut<>{

    float2 res = float2(0.0f,0.0f);
    int2 pos = instance().xy;
    float2 w1,w2,w3,w4,x1,x2,x3,x4;
    int Y = pos.y / 4;
    int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;
    int cntG = Y / gSize;
    int cntAF = Y - gSize * cntG;
    int cntCha = X / nCol;
    int cntP = X%nCol; //X - cntCha*nCol;
    int dataN = nChapSize; // number of source samples
    int Widx, Inputidx;
    int k = 0;

    //compute start index in weights matrix
    Widx = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******

    //compute start index in input matrix
    if(cntG >= firstToSkip)cntG = cntG + SkipLines;
    Inputidx = nCha * (cntG - halbpSize + 1);


    //scalar product
    while(k < dataN){
        w1 = WsI[cntP][Widx];
        Widx += 1;
        w2 = WsI[cntP][Widx];
        Widx += 1;
        w3 = WsI[cntP][Widx];
        Widx += 1;
        w4 = WsI[cntP][Widx];
        Widx += 1;
        x1 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x2 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x3 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x4 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        res.y += w1.y * x1.x + w1.x * x1.y + w2.y * x2.x + w2.x * x2.y + w3.y * x3.x + w3.x * x3.y + w4.y * x4.x + w4.x * x4.y;
        res.x += w1.x * x1.x - w1.y * x1.y + w2.x * x2.x - w2.y * x2.y + w3.x * x3.x - w3.y * x3.y + w4.x * x4.x - w4.y * x4.y;
        k += 4;
    }

    dataOut =  res;

}

0 Likes
5 Replies

first, use a float4 scatter instead of a float2, this reduces the number of reads that you need by a factor of two.
Second, use vector math when possible and swizzles instead of using a bunch of scalar math.
w1 = WsI[cntP][Widx];
Widx += 1;
w2 = WsI[cntP][Widx];
Widx += 1;
w3 = WsI[cntP][Widx];
Widx += 1;
w4 = WsI[cntP][Widx];
Widx += 1;
should be:
w1 = WsI[cntP][Widx];
w2 = WsI[cntP][Widx + 1];
w3 = WsI[cntP][Widx + 2];
w4 = WsI[cntP][Widx + 3];
Widx += 4;

//compute start index in input matrix
if(cntG >= firstToSkip)cntG = cntG + SkipLines;

Can be generated as:
cntG = cntG + (SkipLines * (int)(cntG >= firstToSkip))

Finally, don't use division/modulus unless you absolutely have to.
0 Likes
eduardoschardong
Journeyman III

Hi licoah,

I played a little with your code, but focusing more on the main loop, other than the tips Micah already give I have a few more:

1) When using brook+ in PS mode (always you don't put an [Attribute(GroupSize())} in the kernel it will do fetchs by sampling textures, sampling expect floats as parameters (in fact, float2), if you pass an int it will have to convert from int to foat and only the T unit does that, CS expects int.

2) By being float2 it will generate MOVs if not in the same register, but it is easy to solve.

3) It's possible for you to change the data layout? To me using a pair {X, Y} of float4 instead of 4 float2 seems ok.

One last thing, how slow it is? What's the input data look like? How large streams are?

Here a piece of my code, if all work shloud perform twice as fast:

 

kernel void kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize, int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines, float4 dataInX[][], float4 dataInY[][], float4 WsIX[][], float4 WsIY[][], out float2 dataOut<>) { float2 res = float2(0.0f,0.0f); int2 pos = instance().xy; float4 xX,xY,wX,wY; int Y = pos.y / 4; int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x; int cntG = Y / gSize; int cntAF = Y - gSize * cntG; int cntCha = X / nCol; float cntP = X%nCol; //X - cntCha*nCol; float dataN = nChapSize/4; // number of source samples float4 k = 0; //compute start index in weights matrix k.y = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv******* //compute start index in input matrix if(cntG >= firstToSkip)cntG = cntG + SkipLines; k.z = nCha * (cntG - halbpSize + 1); k.w = cntP; while(k.w < dataN){ wX = WsIX[k.wy]; wY = WsIY[k.wy]; xX = dataInX[k.wz]; xY = dataInY[k.wz]; res.y += dot(wY, xX) + dot(wX, xY); res.x += dot(wX, xX) - dot(wY, xY); k.xyz += 1; } dataOut = res; }

0 Likes

Thany you very much for your help.

I got only 17 Gflops. The card is HD4870.

float2 dataIn {1664,256}

float2 WxI{6144,256 }

float2 dataOut{2048,440}

I use float2, because the data are comlex numbers.

 

0 Likes

I have try to your code. That's nice.

 

But when  k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4, the execution time increased again.

why?

0 Likes

Try:

k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4.0f;

 

As a general note, there are too many integer divisions, they are the slowest type, here the code improved by replacing all integer by floats, using floor for mod.

0 Likes