cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

spectral
Adept II

Merging data from several GPU ?

Hi,

I'm running a kernel on several GPU and a CPU. This kernel update some pixel colors, simply.

The kernel work on a clTask that has 'x,y' coordinates. But several tasks have the same pixel coordinates (Even on the same GPU).

So, I'm searching for an efficient way to merge all theses 'tasks' color (from several GPU and even from the CPU) into one buffer.

Do you have an idea to do this ?

 

Thanks

struct { int x; int y; int4 color, } clTask;

0 Likes
6 Replies
laobrasuca
Journeyman III

you mean better than creating all buffers (for all devices) as pointers to a single host memory buffer?

0 Likes

Imagine I have 4 GPU and each one has 1000 clTask.

So, I have to Compute the average of the colors :

for(int i = 0; i < taskCount; i++)

{

color = (gpu[0].color+gpu[1].color+gpu[2].color+gpu[3].color) * 0.25;

}

 

So, there are several methods :

1 - I retreive all the tasks into RAM, then use CPU to average
2 - Maybe I can use a "shared" memory mecanism to execute a Kernel to average all
3 - Use an OpenGL texture to merge everything

I have no other idea ! Maybe there are some way to transfer some memory betweens GPU without going into the host ? 

0 Likes

Originally posted by: viewon01 Imagine I have 4 GPU and each one has 1000 clTask.

 

So, I have to Compute the average of the colors :

 

for(int i = 0; i < taskCount; i++)

 

{

 

color = (gpu[0].color+gpu[1].color+gpu[2].color+gpu[3].color) * 0.25;

 

 

}

 

 

 

So, there are several methods :

 

1 - I retreive all the tasks into RAM, then use CPU to average 2 - Maybe I can use a "shared" memory mecanism to execute a Kernel to average all 3 - Use an OpenGL texture to merge everything

I believe that what I said above is something close to the option 2, with the shared buffer being the host memory.

I have no other idea ! Maybe there are some way to transfer some memory betweens GPU without going into the host ? 

I've been observing the post and expecting someone more expert on this to comment (and they maybe will), but for now I think option 1 is the best one, specially if you have an at least 4 core processor, in which case you can dispatch into different threads each GPU job than, once everything is done, do the average computing on CPU since the amount of data seems to be quite low (unless it is large, than you could update data to one of your GPU and make a reduction like average).

Option 3 will maybe require memory share between GPUs, and I think in order to share memory between GPU you need to pass trough host memory, unless you do something like CrossFire. I don't think that OpenCL can be use with an CrossFire setup, but OpenGL can, so maybe using textures, but I ignore how the driver shares memory. Other option maybe would be using wglsharelists to share texture or object buffers, but I don't know if you can access a given GPU memory from other GPUs.

Btw, how clCreateBuffer act when you have more than one device on your context? Does it allocate the same amount of memory in each device? Because this function doesn't ask for the devices in the argument list, only the context.

 

0 Likes

I agree with laobrasuca that its better to do the final merging on CPU.

This is what we do in the reduction sample.

 

0 Likes

Thanks, but I'm not sure, because :

1 - I need to read all the buffer from the GPUs...

2 - I don't use OpenCL to compute the average, so no benefit of parellel processing

3 - I have to resend the new buffer

So, why not :

1 - Create a new buffer on the most powerful GPU
2 - Transfer one but one each buffer
3 - Use OpenCL to average

Then I can continue to use this buffer

The problem is phase (2). Because it required a read and a write ! I suppose it is not possible to do this in one operation. Like we read data from the GPU memory to CPU memory, is there a way to read memory from the GPU to another GPU ?

0 Likes

I would suggest to try out both the methods. If there are less number in final reduction sequential CPU can be atleast equally faster as parallel approach.

The operation you ask should take two copies. GPU -> CPU and then from CPU -> GPU(another).

0 Likes