cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

riza_guntur
Journeyman III

Help, how to optimize this code?

I have complete the right code, this code works very slow compared to CPU backend or CPU only code. If this can run faster, I might go for bigger problems in 3D stream for storing image.

The following are my code (I really need help), thanks for your cooperation.

#include "brookgenfiles/all_functions_first.h" #include <ctime> #include <iostream> #include <iomanip> #include <fstream> using namespace std; using namespace brook; unsigned int jumlahData = 480; unsigned int jumlahDataSatuOutput = 80; unsigned int jumlahDiSatuGrup = 5; unsigned int jumlahDimensi = 16; unsigned int jumlahOutput = jumlahData / jumlahDataSatuOutput; unsigned int yA = jumlahData; unsigned int yB = yA/jumlahDiSatuGrup; unsigned int yC = 1;//how many last columns to be ignored in input file unsigned int streamSize[] = {jumlahDimensi,yA}; unsigned int streamSizeReduce[] = {jumlahDimensi,yB}; unsigned int streamSizeReduceRef[] = {jumlahDimensi,jumlahOutput}; unsigned int streamSizeMinOfVecCluster[] = {1,jumlahOutput}; unsigned int streamSizeMaxOfMin[] = {1,1}; float alpha = 0.05f; float beta = 0.05f; float gamma = 0.05f; unsigned short rank[3] = {0,1,2}; int num_of_epoch = 1000; Stream<float4> *input_to_fuzzy(float2 *input_array, unsigned int rank, unsigned int *input_stream_size, unsigned int *fuzzy_stream_size) { Stream<float2> input(rank, input_stream_size);//stream input training Stream<float4> *fuzzy_number = new Stream<float4>(rank, fuzzy_stream_size);//x mean, y for max, z for min input.read(input_array); max_min_mean(input,*fuzzy_number); return fuzzy_number; } int main(int argc, char* argv[]) { printf("FNLVQ Program\n"); float2 *temporary_input_container = new float2[jumlahDimensi*yA]; float4 *fuzzy_number_array = new float4[jumlahDimensi*yB]; float4 *vec_ref_array = new float4[jumlahDimensi*jumlahOutput]; float4 *myu_array= new float4[jumlahDimensi*jumlahOutput]; float4 *min_of_cluster_array = new float4[1*jumlahOutput]; float4 *winner_array = new float4[1*1]; memset(temporary_input_container, 0, jumlahDimensi * yA * sizeof(float2)); memset(fuzzy_number_array, 0, jumlahDimensi*yB * sizeof(float4)); memset(vec_ref_array, 0, jumlahDimensi * jumlahOutput * sizeof(float4)); memset(myu_array, 0, jumlahDimensi * jumlahOutput * sizeof(float4)); memset(min_of_cluster_array, 0, 1*jumlahOutput * sizeof(float4)); memset(winner_array, 0, 1*1 * sizeof(float4)); ifstream inFile; inFile.open("480x16.txt"); if (!inFile) { cout << "Unable to open file"; exit(1); // terminate with error } //printf("Isi dari file\n"); for(unsigned int i = 0; i < streamSize[1]; i++)//reading from file { //printf("Row array ke-%d ",i); for(unsigned int j = 0; j < streamSize[0] + yC; j++) { unsigned int index = i * streamSize[0] + j; unsigned int target = i / jumlahDataSatuOutput; float temp; if( (inFile >> temp) && (j < streamSize[0])) { temporary_input_container[index].x = temp;//read input temporary_input_container[index].y = (float) target; //printf("%.8f %.0f ", temporary_input_container[index].x, temporary_input_container[index].y); } //else //{ // for(unsigned int u = index - jumlahDimensi; u < index; u++) // { // temporary_input_container.y = temp; // //printf("%d %.8f ",u, temporary_input_container.y); // } //} } //printf("\n\n"); } inFile.close(); Stream<float4> *fuzzy_number = input_to_fuzzy(temporary_input_container,rank[2],streamSize,streamSizeReduce); Stream<float4> vec_ref(rank[2], streamSizeReduceRef);//stream of vector reference cluster Stream<float4> myu(rank[2], streamSizeReduceRef);//myu streams Stream<float4> vec_ref_next(rank[2], streamSizeReduceRef); Stream<float4> myu_min(rank[2], streamSizeMinOfVecCluster);//streams of smallest myu in calculated vector reference cluster against fuzzy_number Stream<float4> winner(rank[2], streamSizeMaxOfMin);//biggest of smallest myu //fuzzy_number->write(fuzzy_number_array); //printf("Isi fuzzy number\n"); //for(unsigned int i = 0; i < streamSizeReduce[1]; i++) //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeReduce[0]; j++) // { // unsigned int index = i * jumlahDimensi + j; // printf("%.8f %.8f %.8f %.0f ", fuzzy_number_array[index].x, fuzzy_number_array[index].y, fuzzy_number_array[index].z, fuzzy_number_array[index].w); // } // printf("\n\n"); //} copy4(*fuzzy_number,vec_ref); //printf("Isi reference vector\n"); //vec_ref.write(vec_ref_array); //for(unsigned int i = 0; i < streamSizeReduceRef[1]; i++)//reading from file //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeReduceRef[0]; j++) // { // unsigned int index = i * streamSizeReduceRef[0] + j; // printf("%.8f %.8f %.8f %.0f ",vec_ref_array[index].x, vec_ref_array[index].y, vec_ref_array[index].z, vec_ref_array[index].w); // } // printf("\n\n"); //} //std::cout << "waktu yang dibutuhkan = " << ( ( std::clock() - start ) / (double)CLOCKS_PER_SEC ) <<'\n'; for( int epoch = 0; epoch < 1000; epoch++) { for( int row = 0; row < (int) yB; row++) { myufy(row,alpha,*fuzzy_number,vec_ref,myu); //myu.write(myu_array); //printf("Isi myu\n"); //for(unsigned int i = 0; i < streamSizeReduceRef[1]; i++)//reading from file //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeReduceRef[0]; j++) // { // unsigned int index = i * streamSizeReduceRef[0] + j; // printf("%.8f %.0f %.0f %.0f ",myu_array[index].x, myu_array[index].y, myu_array[index].z, myu_array[index].w); // } // printf("\n\n"); //} myu_min_all(myu,myu_min); //myu_min.write(min_of_cluster_array); //printf("Isi min_of_cluster\n"); //for(unsigned int i = 0; i < streamSizeMinOfVecCluster[1]; i++)//reading from file //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeMinOfVecCluster[0]; j++) // { // unsigned int index = i * streamSizeMinOfVecCluster[0] + j; // printf("%.8f %.0f %.0f %.0f ",min_of_cluster_array[index].x, min_of_cluster_array[index].y, min_of_cluster_array[index].z, min_of_cluster_array[index].w); // } // printf("\n\n"); //} myu_max_min(myu_min,winner); //winner.write(winner_array); //printf("Isi winner_array\n"); //for(unsigned int i = 0; i < streamSizeMaxOfMin[1]; i++)//reading from file //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeMaxOfMin[0]; j++) // { // unsigned int index = i * streamSizeMaxOfMin[0] + j; // printf("%.8f %.0f %.0f %.0f ",winner_array[index].x, winner_array[index].y, winner_array[index].z, winner_array[index].w); // } // printf("\n\n"); //} calc_vec_ref_next(row, alpha, winner, *fuzzy_number, vec_ref, vec_ref_next); //printf("Isi vector reference next\n"); //vec_ref_next.write(vec_ref_array); //for(unsigned int i = 0; i < streamSizeReduceRef[1]; i++)//reading from file //{ // printf("Row array ke-%d ",i); // for(unsigned int j = 0; j < streamSizeReduceRef[0]; j++) // { // unsigned int index = i * streamSizeReduceRef[0] + j; // printf("%.8f %.8f %.8f %.0f ",vec_ref_array[index].x, vec_ref_array[index].y, vec_ref_array[index].z, vec_ref_array[index].w); // } // printf("\n\n"); //} } alpha = 0.9999f * alpha; } delete fuzzy_number; delete[] temporary_input_container; delete[] vec_ref_array; delete[] min_of_cluster_array; delete[] myu_array; delete[] winner_array; delete[] fuzzy_number_array; return 0; } kernel void max_min_mean(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = ++i0; int i2 = ++i1; int i3 = ++i2; int i4 = ++i3; float mean; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mean = 0.2f*(temp0+temp1+temp2+temp3+temp4); output = float4(mean,temp_max,temp_min,input[i0][index.x].y); } kernel void max_min_median(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = ++i0; int i2 = ++i1; int i3 = ++i2; int i4 = ++i3; float mid; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mid = 0.5f*(temp_max+temp_min); output = float4(mid,temp_max,temp_min,input[i0][index.x].y); } kernel void copy4(float4 input<>, out float4 output<>) { output = input; } kernel void myufy(int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 myu<>) { int column = instance().x; float4 fuzz1 = input_fuzzy_numbers[row][column]; float4 fuzz2 = vec_ref; if(fuzz1.x <= fuzz2.x) { myu = float4(clamp((fuzz1.y - fuzz2.z) / (fuzz2.x - fuzz2.z + fuzz1.y - fuzz1.x),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); } else { myu = float4(clamp((fuzz2.y - fuzz1.z) / (fuzz1.x - fuzz1.z + fuzz2.y - fuzz2.x),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); } } reduce void myu_min_all(float4 myu<>, reduce float4 myu_min<>) { if(myu.x < myu_min.x) myu_min = myu; } reduce void myu_max_min(float4 myu_min<>, reduce float4 winner<>) { if(myu_min.x > winner.x) winner = myu_min; } kernel void calc_vec_ref_next( int row, float a, float4 winner[][], float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 vec_ref_next<>) { float k = 1.0f - a; float n = 0.01f * ( 1.0f - a ); float4 temp0 = winner[1][1]; float4 fuzz1 = input_fuzzy_numbers[row][instance().x]; float4 fuzz2 = vec_ref; if(temp0.z != vec_ref.w) { vec_ref_next = vec_ref; } else { if(temp0.x == 0.0f) { vec_ref_next.xw = fuzz2.xw; vec_ref_next.y = fuzz2.x + 1.1f * ( fuzz2.y - fuzz2.x ); vec_ref_next.z = fuzz2.x - 1.1f * ( fuzz2.x - fuzz2.z ); } else { if(fuzz1.w != fuzz2.w) { vec_ref_next.x = fuzz2.x - a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x); vec_ref_next.y = fuzz2.y + ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.y - fuzz2.x); vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.x - fuzz2.z); vec_ref_next.w = fuzz2.w; } else { vec_ref_next.x = fuzz2.x + a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x); vec_ref_next.y = fuzz2.y + ( 1.0f + temp0.x ) * (1.0f + n) * (fuzz2.y - fuzz2.x); vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * (1.0f - n) * (fuzz2.x - fuzz2.z); vec_ref_next.w = fuzz2.w; } } } }

0 Likes
29 Replies
riza_guntur
Journeyman III

I wonder, why if I add more iterations to the examples in Samples, those kernel called faster than mine?

0 Likes

I think it needs a bump

Help

0 Likes

Can you tell which kernel is slow?

0 Likes

I'm sorry, I can't measure it precisely, just using ctime.

From my measurement, the myu_min_all and calc_vec_ref_next consistently eat some time, while myufy and myu_max_min rarely do that.

The copy4 don't have that problem.

0 Likes

Can you comment out all but one kernel and time each one individually?  At least we know which one to work on first.

0 Likes

I ran your code, made a blank input file, and got the following timing:

myufy: 14%

myu_min_all: 47%

myu_max_min: 25%

calc_vec_ref_next: 15%

0 Likes

Want to make sure I am reading your code right...

The intention of myu_min_all is to reduce a 2d stream <16,60> to another 2d stream <1,60>?

0 Likes

Yes. I want to find the smallest element using reduction in myu_min_all, then I reduce it further in myu_max_min to find the largest of the smallest. Is there is another way to do that in single pass?

How do you profile it?

0 Likes

This probably doesn't affect the timing...  But if you are reducing a 1x16 strip, don't you want to change the order of the dimensions so you are reducing from <60,16> to <60,1> instead?

0 Likes

Originally posted by: hagen This probably doesn't affect the timing...  But if you are reducing a 1x16 strip, don't you want to change the order of the dimensions so you are reducing from <60,16> to <60,1> instead?

Because the 16 is a single cluster. I need to find smallest myu element in that cluster. After that I need to get the vector that its minimum element is the biggest from other cluster's smallest element.

Another row in vec_ref notes different cluster. That's why I do it that way. One thing that I wonder, is it in the same speed reducting in x dimension compared to y dimension?

0 Likes

Originally posted by: hagen Want to make sure I am reading your code right...

The intention of myu_min_all is to reduce a 2d stream <16,60> to another 2d stream <1,60>?

I'm sorry, I tell you wrong.

copy4(*fuzzy_number,vec_ref) is to take the first element of 2d stream <16,60> to a stream that has <16,6> elements

myu_min_all is to reduce 2d stream <16,6> to <1,6>

myu_max_min is to reduce <1,6> to <1,1> 

After calc_vec_ref_next I need to copy ONLY the updated row in vec_ref_next to vec_ref. But so far what I do is I copy all the elements of vec_ref_next to vec_ref. The result now is already correct, but I wonder if I can update ONLY the updated row so unneded bandwith I can preserve.

0 Likes

I see.  I did the division wrong.  480/80=6, not 60!

So that's the reason, the matrix you are reducing (<16,6> to <1,6> in myu_min_all and <1,6> to <1,1> in myu_max_min) is just too small.  The time for memory fetch and data read/write must be offset by the compute time. 

You will certainly see better performance with higher reduction ratios (160:1 or even 1600:1).  If you really on;y want to reduce 16:1, you don't want to do that by a reduction kernel.

0 Likes

Originally posted by: hagen I see.  I did the division wrong.  480/80=6, not 60!

So that's the reason, the matrix you are reducing (<16,6> to <1,6> in myu_min_all and <1,6> to <1,1> in myu_max_min) is just too small.  The time for memory fetch and data read/write must be offset by the compute time. 

You will certainly see better performance with higher reduction ratios (160:1 or even 1600:1).  If you really on;y want to reduce 16:1, you don't want to do that by a reduction kernel.

I more likely to reduce <8192,100> to <1,100> in myu_min_all so I think it doesn't need to be replaced.

For less that <1600,100> I think I will let CPU code with OpenMP regain the crown.

For myu_max_min, I think I need to replace it with gather kernel. Does take constant as function parameter make function call slower? If I replace it with gather kernel, I need to have constant parameter controlling the loop.

0 Likes

If myu_max_min reduces <1,100> to <1,1>, it may be faster to do it on the cpu.  But reading the stream out to cpu takes time, so you may just as well call a reduction kernel to do it on the gpu.  Even 100:1 reduction might be ok (shouldn't be slower than cpu).  A gather kernel (you are thinking one thread looping through 100 elements?) will certainly take more time.

0 Likes

Originally posted by: hagen If myu_max_min reduces <1,100> to <1,1>, it may be faster to do it on the cpu.  But reading the stream out to cpu takes time, so you may just as well call a reduction kernel to do it on the gpu.  Even 100:1 reduction might be ok (shouldn't be slower than cpu).  A gather kernel (you are thinking one thread looping through 100 elements?) will certainly take more time.

OK, I keep it then. Thanks a lot

0 Likes

I take a fast look on it, some notes:

1) Are the mean and median correct? Remeber that, ++i0 is different than i0+1 in the sense it have side effects;

2) Are all the logic correct?

 

I have done a small optimization on mean, mediam and myufy, mostly for branches, probably not enough to solve all your performance problems (note the "i+1" where you had i0 and the two "i+4", is this desired?):

kernel void max_min_mean(float2 input[][], out float4 output<>)
{
   int2 index = instance().xy;
   int i = index.y * 5;
   float temp0 = input[i + 1][index.x].x;
   float temp1 = input[i + 2][index.x].x;
   float temp2 = input[i + 3][index.x].x;
   float temp3 = input[i + 4][index.x].x;
   float temp4 = input[i + 4][index.x].x;
   float temp_max = max(max(max(temp0, temp1), max(temp2, temp3)), temp4);
   float temp_min = min(min(min(temp0, temp1), min(temp2, temp3)), temp4);
   float mean = 0.2f*(temp0+temp1+temp2+temp3+temp4);
   output = float4(mean,temp_max,temp_min,input[i + 1][index.x].y);
}

kernel void max_min_median(float2 input[][], out float4 output<>)
{
   int2 index = instance().xy;
   int i = index.y * 5;
   float temp0 = input[i + 1][index.x].x;
   float temp1 = input[i + 2][index.x].x;
   float temp2 = input[i + 3][index.x].x;
   float temp3 = input[i + 4][index.x].x;
   float temp4 = input[i + 4][index.x].x;
   float temp_max = max(max(max(temp0, temp1), max(temp2, temp3)), temp4);
   float temp_min = min(min(min(temp0, temp1), min(temp2, temp3)), temp4);
   float mean = 0.2f*(temp0+temp1+temp2+temp3+temp4);
   float mid = 0.5f*(temp_max+temp_min);
   output = float4(mid,temp_max,temp_min,input[i + 1][index.x].y);
}

kernel void myufy(int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 myu<>)
{
   int column = instance().x;
   float4 fuzz1 = input_fuzzy_numbers[row][column];
   float4 fuzz2 = vec_ref;
   float diffx = fuzz2.x - fuzz1.x;
   float diffyz = diffx > 0.0f ? fuzz1.y - fuzz2.z : fuzz2.y - fuzz1.z;
   myu = float4(clamp(diffyz / abs(diffx) + 1.0f,0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f);
}

I like reductions syntax, but they are slow, you my try to replace them by normal streams, I will try... If I get some free time...

 

0 Likes

Originally posted by: eduardoschardong I take a fast look on it, some notes:

1) Are the mean and median correct? Remeber that, ++i0 is different than i0+1 in the sense it have side effects;

2) Are all the logic correct?

 

I have done a small optimization on mean, mediam and myufy, mostly for branches, probably not enough to solve all your performance problems (note the "i+1" where you had i0 and the two "i+4", is this desired?):

kernel void max_min_mean(float2 input[][], out float4 output<>) {    int2 index = instance().xy;    int i = index.y * 5;    float temp0 = input[i + 1][index.x].x;    float temp1 = input[i + 2][index.x].x;    float temp2 = input[i + 3][index.x].x;    float temp3 = input[i + 4][index.x].x;    float temp4 = input[i + 4][index.x].x;    float temp_max = max(max(max(temp0, temp1), max(temp2, temp3)), temp4);    float temp_min = min(min(min(temp0, temp1), min(temp2, temp3)), temp4);    float mean = 0.2f*(temp0+temp1+temp2+temp3+temp4);    output = float4(mean,temp_max,temp_min,input[i + 1][index.x].y); }

kernel void max_min_median(float2 input[][], out float4 output<>) {    int2 index = instance().xy;    int i = index.y * 5;    float temp0 = input[i + 1][index.x].x;    float temp1 = input[i + 2][index.x].x;    float temp2 = input[i + 3][index.x].x;    float temp3 = input[i + 4][index.x].x;    float temp4 = input[i + 4][index.x].x;    float temp_max = max(max(max(temp0, temp1), max(temp2, temp3)), temp4);    float temp_min = min(min(min(temp0, temp1), min(temp2, temp3)), temp4);    float mean = 0.2f*(temp0+temp1+temp2+temp3+temp4);    float mid = 0.5f*(temp_max+temp_min);    output = float4(mid,temp_max,temp_min,input[i + 1][index.x].y); }

kernel void myufy(int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 myu<>) {    int column = instance().x;    float4 fuzz1 = input_fuzzy_numbers[row][column];    float4 fuzz2 = vec_ref;    float diffx = fuzz2.x - fuzz1.x;    float diffyz = diffx > 0.0f ? fuzz1.y - fuzz2.z : fuzz2.y - fuzz1.z;    myu = float4(clamp(diffyz / abs(diffx) + 1.0f,0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); }

I like reductions syntax, but they are slow, you my try to replace them by normal streams, I will try... If I get some free time...

 

Thanks for the correction, I didn't consider about that incremental. But your optimization for myufy is incorrect, it should be:

myu =



float4(clamp((diffyz)/(abs(diffx)+diffyz),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f);

 



which in SKA uses same number of register.

The following are the new code, with correct mean and median function, as well as arbitrary group size (number of samples for one fuzzy number).

kernel void max_min_mean(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = i0+1; int i2 = i1+1; int i3 = i2+1; int i4 = i3+1; float mean; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mean = 0.2f*(temp0+temp1+temp2+temp3+temp4); output = float4(mean,temp_max,temp_min,input[index.y][index.x].y); } kernel void var_max_min_mean(int group_size, float2 input[][], out float4 output<>) { int2 index = instance().xy; int count; int i0 = group_size*index.y; float mean; float temp0 = input[i0][index.x].x; float temp_max = temp0; float temp_min = temp0; float sum = temp0; for(count = 1; count < group_size; count++) { temp0 = input[++i0][index.x].x; temp_max = (temp_max>temp0)?temp_max:temp0; temp_min = (temp_min<temp0)?temp_min:temp0; sum += temp0; } mean = sum / (float) group_size; output = float4(mean,temp_max,temp_min,input[index.y][index.x].y); } kernel void max_min_median(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = i0+1; int i2 = i1+1; int i3 = i2+1; int i4 = i3+1; float mid; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mid = 0.5f*(temp_max+temp_min); output = float4(mid,temp_max,temp_min,input[index.y][index.x].y); } kernel void var_max_min_median(int group_size,float2 input[][], out float4 output<>) { int2 index = instance().xy; int count; int i0 = group_size*index.y; float mid; float temp0 = input[i0][index.x].x; float temp_max = temp0; float temp_min = temp0; for(count = 1; count < group_size; count++) { temp0 = input[++i0][index.x].x; temp_max = (temp_max>temp0)?temp_max:temp0; temp_min = (temp_min<temp0)?temp_min:temp0; } mid = 0.5f* (temp_max + temp_min); output = float4(mid,temp_max,temp_min,input[index.y][index.x].y); } kernel void copy4(float4 input<>, out float4 output<>) { output = input; } kernel void myufy( int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 myu<>) { int column = instance().x; float4 fuzz1 = input_fuzzy_numbers[row][column]; float4 fuzz2 = vec_ref; if(fuzz1.x <= fuzz2.x) { myu = float4(clamp((fuzz1.y - fuzz2.z) / (fuzz2.x - fuzz2.z + fuzz1.y - fuzz1.x),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); } else { myu = float4(clamp((fuzz2.y - fuzz1.z) / (fuzz1.x - fuzz1.z + fuzz2.y - fuzz2.x),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); } } reduce void myu_min_all(float4 myu<>, reduce float4 myu_min<>) { if(myu.x < myu_min.x) myu_min = myu; } reduce void myu_max_min(float4 myu_min<>, reduce float4 winner<>) { if(myu_min.x > winner.x) winner = myu_min; } kernel void calc_vec_ref_next( int row, float a, float4 winner[][], float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 vec_ref_next<>) { float k = 1.0f - a; float n = 0.01f * ( 1.0f - a ); float4 temp0 = winner[1][1]; float4 fuzz1 = input_fuzzy_numbers[row][instance().x]; float4 fuzz2 = vec_ref; if(temp0.z != vec_ref.w) { vec_ref_next = vec_ref; } else { if(temp0.x == 0.0f) { vec_ref_next.xw = fuzz2.xw; vec_ref_next.y = fuzz2.x + 1.1f * ( fuzz2.y - fuzz2.x ); vec_ref_next.z = fuzz2.x - 1.1f * ( fuzz2.x - fuzz2.z ); } else { if(fuzz1.w != fuzz2.w) { vec_ref_next.x = fuzz2.x - a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x); vec_ref_next.y = fuzz2.y + ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.y - fuzz2.x); vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.x - fuzz2.z); vec_ref_next.w = fuzz2.w; } else { vec_ref_next.x = fuzz2.x + a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x); vec_ref_next.y = fuzz2.y + ( 1.0f + temp0.x ) * (1.0f + n) * (fuzz2.y - fuzz2.x); vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * (1.0f - n) * (fuzz2.x - fuzz2.z); vec_ref_next.w = fuzz2.w; } } } } kernel void myufy_opt( int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref<>, out float4 myu<>) { int column = instance().x; float4 fuzz1 = input_fuzzy_numbers[row][column]; float4 fuzz2 = vec_ref; float diffx = fuzz2.x - fuzz1.x; float diffyz = diffx > 0.0f?fuzz1.y-fuzz2.z:fuzz2.y-fuzz1.z; myu = float4(clamp((diffyz)/(abs(diffx)+diffyz),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f); }

0 Likes

I confuse, my myufy code takes 2 GPR while my so called optimized myufy_opt takes 3 GPR. Those two deploy same number of threads.

0 Likes

Sorry for the mistake in myufy, I was too tired yesterday

 

I wouldn't care so much if the number of GPRs is 2 or 3, there are 128 of them and there will be an enough number of threads, the problem I see in the original version is the branch, the opt version takes 11 ALU clauses and that's all, the orginal version needs 11 ALU clauses if all in a wavefront goes to the same path, but 18 if they diverge, SKA fails to show that, in the min and max helps to reduce the number of ALU clauses.

Another note looking at SKA, by changing the vec_ref parameter to a gather  stream while there is an extra ALU both SAMPLES goes together wich may give some speed up and again, SKA fails to show that.

 

Do you have sample input data?

 

0 Likes

Another note looking at SKA, by changing the vec_ref parameter to a gather  stream while there is an extra ALU both SAMPLES goes together wich may give some speed up and again, SKA fails to show that.


Changing vec_ref to gather in which kernel?

Probably in both calc_vec_ref_next and  myufy_opt since they refer to the same behaviour, accessing the same sample in which I think it can access current sample in in the cache directly. Am I correct?

Do you have sample input data?


Here it goes, training sample, copy it to notepad and save as 480x16.txt (or whatever you wish). Those are normalized value of an odour. 1 last column is the target which will be ignored by set yC to 1.

0 Likes

Do you mean change vec_ref to gather like these:

kernel

 





 

void calc_vec_ref_next( int row, float a, float4 winner[][], float4 input_fuzzy_numbers[][], float4 vec_ref[][], out float4 vec_ref_next<>)

{

 





 

float k = 1.0f - a;

 



 

float n = 0.01f * ( 1.0f - a );

 



 

float4 temp0 = winner[1][1];

 



 

float4 fuzz1 = input_fuzzy_numbers[row][instance().x];

 



 

float4 fuzz2 = vec_ref[instance().y][instance().x];

 



 

if(temp0.z != fuzz2.w)

{

vec_ref_next = fuzz2;

}

 





 

else

 

 

 

 

 

 

 

 

 

 

 

 

{

 





 

if

(temp0.x == 0.0f)

{

vec_ref_next.xw = fuzz2.xw;

vec_ref_next.y = fuzz2.x + 1.1f * ( fuzz2.y - fuzz2.x );

vec_ref_next.z = fuzz2.x - 1.1f * ( fuzz2.x - fuzz2.z );

}

 





 

 

else

{

 



 

if

(fuzz1.w != fuzz2.w)

{

vec_ref_next.x = fuzz2.x - a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x);

vec_ref_next.y = fuzz2.y + ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.y - fuzz2.x);

vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * ( 1.0f - k ) * (fuzz2.x - fuzz2.z);

vec_ref_next.w = fuzz2.w;

}

 





 

 

else

{

vec_ref_next.x = fuzz2.x + a * (1.0f - temp0.x) * (fuzz1.x - fuzz2.x);

vec_ref_next.y = fuzz2.y + ( 1.0f + temp0.x ) * (1.0f + n) * (fuzz2.y - fuzz2.x);

vec_ref_next.z = fuzz2.z - ( 1.0f - temp0.x ) * (1.0f - n) * (fuzz2.x - fuzz2.z);

vec_ref_next.w = fuzz2.w;

}

}

}

}

kernel

 



 

void myufy_opt( int row, float a, float4 input_fuzzy_numbers[][], float4 vec_ref[][], out float4

myu<>)

{

 





 

 

int column = instance

().x;

 



 

 

float4

fuzz1 = input_fuzzy_numbers[row][column];

 



 

 

float4 fuzz2 = vec_ref[instance

().y][column];

 



 

 

float

diffx = fuzz2.x - fuzz1.x;

 



 

 

float

diffyz = diffx > 0.0f?fuzz1.y-fuzz2.z:fuzz2.y-fuzz1.z;

myu =



 

 

float4

(clamp((diffyz)/(abs(diffx)+diffyz),0.0f,1.0f),fuzz1.w,fuzz2.w,0.0f);

}

 





 

 

 

One below is test data, I will create test phase about 17th of august as well as separating main, training step and test step. Maybe I will replace the calc_vec_ref_next with just vec_ref as output then run the code using BRT_PERMIT_READ_WRITE_ALIASING to allow it run in single kernel (a bad thing maybe? I don't know). If we can do both read and write for one stream, I'm pretty sure myufy and calc_vec_ref can merge, maybe the iterative calls of kernel code could be replaced whole into one kernel. But there is still a problem in thread access, all wavefront will process it redundantly

0 Likes

You might look at combining some of your kernels if you can. Apparently they don't use many GPRs? (Register Pressure is not as big an issue as it is in CUDA) and you might be able to mask some of the register usage while increasing the ALU Ops, reducing kernel overhead and increasing burst writes.

Honestly, I didn't look at your code, this is just a general suggestion.

0 Likes

Originally posted by: ryta1203 You might look at combining some of your kernels if you can. Apparently they don't use many GPRs? (Register Pressure is not as big an issue as it is in CUDA) and you might be able to mask some of the register usage while increasing the ALU Ops, reducing kernel overhead and increasing burst writes.

Honestly, I didn't look at your code, this is just a general suggestion.

It's possible to some extent coming from BRT_PERMIT_READ_WRITE_ALIASING. Since Brook+ don't support huge LDS, it is hard to overcome this.

0 Likes

Is there no other way to speed up more?

I mean, it's about 10 times the CPU (Intel Pentium E4300 single thread) but if there is anything to speed it up it will be good.

Must I rearrange the data structure perhaps? I don't know any other way.

0 Likes

I think the way you can get a performance improvement is by reducing the number of kernel calls, putting all that loop inside a kernel.

 

I know this isn't so easy, I would try something if I had some spare time... end phase of a project at my work, you know...

 

0 Likes

You should be pretty busy. But let's read what I get so far.

As far as I try to optimize it, I have seen some improvement about 10 percent by using:

1. More gather in myufy_opt. I use your idea of change vec_ref to gather stream.

2. Eliminating copy4 at the end of loop to copy from vec_ref_next to vec_ref by using BRT_PERMIT_READ_WRITE_ALIASING. As from now, the results are correct.

I measure on jumlahOutput (number of output) 18 and jumlahDimensi 8192. The CPU time for jumlahOutput (number of output) 18 and jumlahDimensi 16 is 1.141, if we use big jumlahDimensi like 8192 at least it will be 512 times slower while test in my code consumes only 35 seconds. Oh yeah I forgot to mention, num_of_epoch is 100.

So 512 seconds:35 seconds = 16:1. I'm pretty sure in HD4890 1GHz 1050MHz there will be even bigger improvement since 4850 lack of bandwidth.

Further, test has shown that in small case jumlahDimensi (16), the use of gather kernel (looping from first index to 16) unlike in myu_min_all shows faster than reduction kernel. I'm pretty sure for myu_max_min it will give improvement too.

The performance improvement againts CPU code I think would be greater than that.

That's as far as I can get. Sadly, the calc_vec_ref_next suffers from computational error. Will be fixed soon.

0 Likes

The code is fixed, the test cycle has been made, it is finish

0 Likes

I wonder, why if I add more iterations to the examples in Samples, those kernel called faster than mine?


Brook+ has to compile the generated IL assembly at runtime. But, it does it only once and use the cached program information in next iteration.

The intention of myu_min_all is to reduce a 2d stream <16,60> to another 2d stream <1,60>?


You won't see any perfomance improvement with such small data set.

0 Likes

I changed the original stream size of riza.guntur's reduction from <16,60> to <1,60> to a reduction from <16,600> to <1,600>.  The timing is almost exactly 10x.  So the runtime compilation of IL doesn't seem to account for the difference.

But I tried a third reduction from <1600,60> to <1,60>.  The timing is not even 2x from the original.

What is actually happening on the CAL/IL level in a reduction kernel?

0 Likes