AnsweredAssumed Answered

splitting work between multiple GPU devices (image processing, embarrassingly parallel)

Question asked by jason on Jan 15, 2015
Latest reply on Jan 21, 2015 by jason

Hi!

 

I am doing image processing in real-time contexts and I have 2 GPUs in a laptop to work with (R9 m290X's - 20 CUs each).  I would like to split roughly half of the input rows of each image to each GPU and have them output to the same buffer then glue it back together - the images are sizes of 2044x2044 rowsxcolumns of int16 or int32, single channel, stored in row-major order.

 

I tried to split this via 2 kernel calls to 2 queus created with a single shared context, shrinking global_work_size[1]/=2  and global_work_offset[1]+= rows/2 - reading from the same clBuffer and outputing to the same clBuffer (src != dst).  The output ranges are completely non-overlapping.  The input is a little overlapping (sort of like a convolution kernel window's overlap - it's only going to overlap by 5 elements with a kernel dimension of 11).

 

I observe the following taking the best of 15 runs of 100 loops (ipython timeit):

Single GPU:

global work dims, global offset, group size:

GPUx: (2044, 2044) (0, 0) (32, 8)


~ 2ms with any single GPU - devices[0] or devices[1].

 

Multi GPU:

global work dims, global offset, group size:

GPU0: (2044, 1022) (0, 0) (32, 8)

GPU1: (2044, 1022) (0, 1022) (32, 8)

 

~ 3ms with both GPU devices and no shared buffers (I create dummy src and dst clBuffers for each individual kernel call for the sake of benchmarking this)

~10ms with both GPU devices and using the shared input clBuffer and shared output clBuffer (again output ranges are completely non-overlapping and input ranges are almost completely non-overlapping).

 

I expected a linear speedup, what gives?  I'm getting worse than linear - 1.5 and 5x slower in the above experiments.

 

I googled on the topic and mostly got ancient threads but I did find a few bits relating mostly to nvidia's implementation:

 

https://devtalk.nvidia.com/default/topic/473251/cuda-programming-and-performance/single-vs-multiple-contexts-with-multip…

 

I use events and wait on them only after both kernel's have been submitted.  I wait on both of them before going to the next loop of the benchmark.

 

The next experiment would be splitting this over multiple contexts but this looks like a pain to carry through supporting that in code - I'd rather gain some understanding as to why the numbers are as they are before I go off on that.  Running my benchmark program twice, simultaneously, each targeting a single and different GPU does indeed show 2ms for each program individuallly.

 

As noted in another thread, I do have to set environmental variable GPU_NUM_COMPUTE_RINGS=1 to get good timings out of GPU0 on par with GPU1.

Outcomes