cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bjang
Journeyman III

reduction function which finds the index of max value in array

Hello,

I would like to write a reduction kernel which finds (or returns) the index of max value in array.

We have seen a beautiful reduction function which finds max value in array as follows by Ceq and by jski.

reduce void red(double a<>, reduce double b<>{
    if(b < a)   b = a;
}

reduce void red( float value<>, reduce float result<> ){
   result = max( value, value );
}

Now, I would like to find the index of that max value. Do you guys think it is possible with reduction kernel?

0 Likes
9 Replies

bjang,
This is where the vector properties of the graphic card come in very handy. This can actually be handled with very minimal changes to the code.
reduce void red(float a<>, reduce float4 b<> )
{

if(b.x < a)
{
b.x = a;
b.y = indexof(a).x;
b.z = indexof(a).y;
}
}
0 Likes

Micah, thanks very much for helping me.

BTW, I tried your suggestion as follows and got incorrect answer. I am suspecting that indexof() is not valid in reduction kernel.

====================================

reduce void red(float a<>, reduce float4 b<> {
    if(b.x > a) {
        b.x = a;
        b.y = indexof(a).x;
        b.z = indexof(a).y;
    }
}

int main(int argc, char* argv[])
{
    float input[8] = { 5, 2, 8, 6, 3, 7, 9, 4} ;
    float4 result;

    {
        float inputStream<8>;
        streamRead(inputStream, input);
        red(inputStream, result);
    }

    printf("result.x is %f \n", result.x);
    printf("result.y is %f \n", result.y);
    printf("result.z is %f \n", result.z);

    return 0;
}

========================

result.x is 2.000000

result.y is 2.000000

result.z is 2.000000

========================

0 Likes

Ceq,

I am very positive that you have a correct answer. Thanks.

BTW, do you have any other idea to do it efficiently on GPU, other than reduction kernel?

0 Likes
Ceq
Journeyman III

That would be a nice solution, however looks like there is a issues in the current Brook+ version:

reduce void red1(float a<>, reduce float4 b<> )

This doesn't work, it would just compute the maximum value.
Looks like only b.x value assignation is performed, so the result would be b.y = b.z = b.x = max(a<> );

reduce void red2(float4 a<>, reduce float4 b<> )

Now the assignment works properly, looks like there is a bug and both streams should be equal type.
However the result is wrong again, b.x = max(a<> ); b.y = b.z = 0;
Note that in the documentation section 6.1 states that indexof operator is not valid for reduction

Note: Only tested on Radeon 4850
0 Likes
Ceq
Journeyman III

Well, I have a workaround, however it's better to wait for Micah's answer.
Anyway, if you want to see it:

kernel void setIndex(float a<>, out float2 b<>) {
b.x = a;
b.y = indexof(a);
}

reduce void red1(float2 a<>, reduce float2 b<>) {
if(b < a) b = a;
}


int main(int argc, char* argv[]) {
float input[8] = { 5, 2, 8, 6, 3, 7, 9, 4};
float2 result = float2(0, 0);
{
float inputS1<8>;
float2 inputS2<8>;
streamRead(inputS1, input);
setIndex(inputS1, inputS2);
red1(inputS2, result);
}
printf("Max value is input[%.0f] = %f\n", result.y, result.x);
return 0;
}

Note that it would be better to use a separate index stream, but you can't use more than one input stream in a reduction.

---------------

Without using Brook+ reductions you would have to implement a multistage algorithm with gather, calling a kernel several times using domain operator, it's a bit complex, so I just write a example:

iterative_max_calc(float input[ ], float size, out float pos<>, out float max<>);

input1 : 5, 2, 8, 6, 3, 7, 9, 4
size: 8;
output1max : 5, 8, 7, 9
output1pos : 0, 2, 5, 6

input2 : output1max
size : 4;
output2max : 8, 9
output2pos : 2, 6

input3 : output2max
size : 2;
output3max : 9
output3pos : 6
0 Likes

Ahh yeah, I forgot that indexof cannot be used in a reduction kernel. The reason is because a reduction kernel is just a shortcut for doing a multi-stage algorithm, but the Brook+ Runtime handles the multiple stages for the developer. A seperate pass can be used to encode the index into a seperate component of vector is a solution similiar to what I was attempting to get across.

Hope this solved your problem.
0 Likes

Thanks, Ceq and Micah,

BTW, I keep wondering what is benefit of using reduction kernel in terms of performance. I couldn't find any information on performance of reduction kernel. Of course, it looks pretty just like recursive function in C.

The disassembled binary of reduction kernel looks pretty neat in GSA, seeming very fast. I didn't measure the performance yet though.

Thanks in advance.

0 Likes
nberger
Adept I

Hi!

Performance wise I am a bit disappointed in reductions; I sum over a 8192 element 1D-array. For numeric reasons, the sum has to be performed in doubles, and because reductions require identical types in in- and output, I need another "conversion" kernel. Together, these kernels are ~10x slower than copying the stream to host memory and summing with a simple for loop on the CPU. I also tested just summing floats, and even this is slower than the CPU.
Cheers

Nik
0 Likes

My problem is similar but I have to find minimum of my  3D stream and reduce it to 2D. As a result of result of reduction I sould have index of c-value from input1 stream.

reduce kernel void test2(float input1 <a,b,c> , reduce float output1<a,b> )
{
 
}

I don't really know how to deal with this. Could anyone help how to write in  Brook?

0 Likes