Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

How to split job to more than one gpu and then combine the result?

I have the attached kernel, but since the datasets is rather small after processed about one fifth of the input I have plan to increase the size as well as splitting the job to more than one gpu

My question is:

1. How do I combine the result data in different gpu? Should I copied it first using assign() operation to the gpu that latter would process the result data and then using a kernel to concate the data into larger stream?

2. Is there any other way to do similar thing faster?

Thank you

kernel void max_min_mean(float2 input[][], out float4 output<>) { int2 index = instance().xy; int i0 = 5*index.y; int i1 = ++i0; int i2 = ++i1; int i3 = ++i2; int i4 = ++i3; float mean; float temp0 = input[i0][index.x].x; float temp1 = input[i1][index.x].x; float temp2 = input[i2][index.x].x; float temp3 = input[i3][index.x].x; float temp4 = input[i4][index.x].x; float temp_max = temp0; float temp_min = temp0; temp_max = (temp_max>temp1)?temp_max:temp1; temp_max = (temp_max>temp2)?temp_max:temp2; temp_max = (temp_max>temp3)?temp_max:temp3; temp_max = (temp_max>temp4)?temp_max:temp4; temp_min = (temp_min<temp1)?temp_min:temp1; temp_min = (temp_min<temp2)?temp_min:temp2; temp_min = (temp_min<temp3)?temp_min:temp3; temp_min = (temp_min<temp4)?temp_min:temp4; mean = 0.2f*(temp0+temp1+temp2+temp3+temp4); output = float4(mean,temp_max,temp_min,input[i0][index.x].y); }

5 Replies
Journeyman III

Why don't you combine the results on the CPU?

The "assign" operation is copy the source stream to system memory, then copy the data to the destination stream. So it is not an efficient way for you.


Is combining results on the GPU faster?

I thought doing all inside the GPU is faster.

Thanks for "assign" operation information, I shall change my code.

At first I thought "assign" doing some sort of stuff like copy kernel, dunno if it copied source stream to system memory first to the CPU then copy to the destination stream.


We are building a cluster with several GPUs on each node.  MPI would be our choice.


Originally posted by: hagen We are building a cluster with several GPUs on each node.  MPI would be our choice.

I just want to make small desktop application that runs in multiple GPUs, not cluster

Anyhow I still new to this field, not really understand about MPI except searching for the Open MPI website.

Please explain something


If you run on 1 cpu, then wgbljl's suggestion would be best: divide up the data on the cpu into several streams and distribute each piece to a different gpu (i.e. gpus don't share data).

On the other hand, MPI allows coarse-grain parallelism. You can start several threads on the same cpu, and have each thread access a different gpu.  This method is easily scalable to a multinode cluster.