I has optimized this kernel. But the performance is not very good.
Are there some special tricks in Brook+, which I have not used for this kernel?
kernel void
kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize, int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,float2 dataIn[][], float2 WsI[][], out float2 dataOut<>{
float2 res = float2(0.0f,0.0f);
int2 pos = instance().xy;
float2 w1,w2,w3,w4,x1,x2,x3,x4;
int Y = pos.y / 4;
int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;
int cntG = Y / gSize;
int cntAF = Y - gSize * cntG;
int cntCha = X / nCol;
int cntP = X%nCol; //X - cntCha*nCol;
int dataN = nChapSize; // number of source samples
int Widx, Inputidx;
int k = 0;
//compute start index in weights matrix
Widx = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******
//compute start index in input matrix
if(cntG >= firstToSkip)cntG = cntG + SkipLines;
Inputidx = nCha * (cntG - halbpSize + 1);
//scalar product
while(k < dataN){
w1 = WsI[cntP][Widx];
Widx += 1;
w2 = WsI[cntP][Widx];
Widx += 1;
w3 = WsI[cntP][Widx];
Widx += 1;
w4 = WsI[cntP][Widx];
Widx += 1;
x1 = dataIn[cntP][Inputidx];
Inputidx += 1;
x2 = dataIn[cntP][Inputidx];
Inputidx += 1;
x3 = dataIn[cntP][Inputidx];
Inputidx += 1;
x4 = dataIn[cntP][Inputidx];
Inputidx += 1;
res.y += w1.y * x1.x + w1.x * x1.y + w2.y * x2.x + w2.x * x2.y + w3.y * x3.x + w3.x * x3.y + w4.y * x4.x + w4.x * x4.y;
res.x += w1.x * x1.x - w1.y * x1.y + w2.x * x2.x - w2.y * x2.y + w3.x * x3.x - w3.y * x3.y + w4.x * x4.x - w4.y * x4.y;
k += 4;
}
dataOut = res;
}
Hi licoah,
I played a little with your code, but focusing more on the main loop, other than the tips Micah already give I have a few more:
1) When using brook+ in PS mode (always you don't put an [Attribute(GroupSize())} in the kernel it will do fetchs by sampling textures, sampling expect floats as parameters (in fact, float2), if you pass an int it will have to convert from int to foat and only the T unit does that, CS expects int.
2) By being float2 it will generate MOVs if not in the same register, but it is easy to solve.
3) It's possible for you to change the data layout? To me using a pair {X, Y} of float4 instead of 4 float2 seems ok.
One last thing, how slow it is? What's the input data look like? How large streams are?
Here a piece of my code, if all work shloud perform twice as fast:
kernel void kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize, int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines, float4 dataInX[][], float4 dataInY[][], float4 WsIX[][], float4 WsIY[][], out float2 dataOut<>) { float2 res = float2(0.0f,0.0f); int2 pos = instance().xy; float4 xX,xY,wX,wY; int Y = pos.y / 4; int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x; int cntG = Y / gSize; int cntAF = Y - gSize * cntG; int cntCha = X / nCol; float cntP = X%nCol; //X - cntCha*nCol; float dataN = nChapSize/4; // number of source samples float4 k = 0; //compute start index in weights matrix k.y = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv******* //compute start index in input matrix if(cntG >= firstToSkip)cntG = cntG + SkipLines; k.z = nCha * (cntG - halbpSize + 1); k.w = cntP; while(k.w < dataN){ wX = WsIX[k.wy]; wY = WsIY[k.wy]; xX = dataInX[k.wz]; xY = dataInY[k.wz]; res.y += dot(wY, xX) + dot(wX, xY); res.x += dot(wX, xX) - dot(wY, xY); k.xyz += 1; } dataOut = res; }
Thany you very much for your help.
I got only 17 Gflops. The card is HD4870.
float2 dataIn {1664,256}
float2 WxI{6144,256 }
float2 dataOut{2048,440}
I use float2, because the data are comlex numbers.
I have try to your code. That's nice.
But when k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4, the execution time increased again.
why?
Try:
k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4.0f;
As a general note, there are too many integer divisions, they are the slowest type, here the code improved by replacing all integer by floats, using floor for mod.