I have the attached kernel, but since the datasets is rather small after processed about one fifth of the input I have plan to increase the size as well as splitting the job to more than one gpu

My question is:

1. How do I combine the result data in different gpu? Should I copied it first using assign() operation to the gpu that latter would process the result data and then using a kernel to concate the data into larger stream?

2. Is there any other way to do similar thing faster?

Thank you

kernel void max_min_mean(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = ++i0; int i2 = ++i1; int i3 = ++i2; int i4 = ++i3; float mean; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mean = 0.2f*(temp0+temp1+temp2+temp3+temp4); output = float4(mean,temp_max,temp_min,input[i0][index.x].y); }

Why don't you combine the results on the CPU?

The "assign" operation is copy the source stream to system memory, then copy the data to the destination stream. So it is not an efficient way for you.