cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jason
Adept III

splitting work between multiple GPU devices (image processing, embarrassingly parallel)

Hi!

I am doing image processing in real-time contexts and I have 2 GPUs in a laptop to work with (R9 m290X's - 20 CUs each).  I would like to split roughly half of the input rows of each image to each GPU and have them output to the same buffer then glue it back together - the images are sizes of 2044x2044 rowsxcolumns of int16 or int32, single channel, stored in row-major order.

I tried to split this via 2 kernel calls to 2 queus created with a single shared context, shrinking global_work_size[1]/=2  and global_work_offset[1]+= rows/2 - reading from the same clBuffer and outputing to the same clBuffer (src != dst).  The output ranges are completely non-overlapping.  The input is a little overlapping (sort of like a convolution kernel window's overlap - it's only going to overlap by 5 elements with a kernel dimension of 11).

I observe the following taking the best of 15 runs of 100 loops (ipython timeit):

Single GPU:

global work dims, global offset, group size:

GPUx: (2044, 2044) (0, 0) (32, 😎


~ 2ms with any single GPU - devices[0] or devices[1].

Multi GPU:

global work dims, global offset, group size:

GPU0: (2044, 1022) (0, 0) (32, 😎

GPU1: (2044, 1022) (0, 1022) (32, 😎

~ 3ms with both GPU devices and no shared buffers (I create dummy src and dst clBuffers for each individual kernel call for the sake of benchmarking this)

~10ms with both GPU devices and using the shared input clBuffer and shared output clBuffer (again output ranges are completely non-overlapping and input ranges are almost completely non-overlapping).

I expected a linear speedup, what gives?  I'm getting worse than linear - 1.5 and 5x slower in the above experiments.

I googled on the topic and mostly got ancient threads but I did find a few bits relating mostly to nvidia's implementation:

https://devtalk.nvidia.com/default/topic/473251/cuda-programming-and-performance/single-vs-multiple-...

I use events and wait on them only after both kernel's have been submitted.  I wait on both of them before going to the next loop of the benchmark.

The next experiment would be splitting this over multiple contexts but this looks like a pain to carry through supporting that in code - I'd rather gain some understanding as to why the numbers are as they are before I go off on that.  Running my benchmark program twice, simultaneously, each targeting a single and different GPU does indeed show 2ms for each program individuallly.

As noted in another thread, I do have to set environmental variable GPU_NUM_COMPUTE_RINGS=1 to get good timings out of GPU0 on par with GPU1.

0 Likes
5 Replies
nou
Exemplar

I think your problem is that you are writing to the same buffer. Or at least that is what I understand from your description. Simultaneous writing to same buffer on two devices result in undefined result. It is possible that OpenCL runtime try prevent that so it run first half on first GPU then move the buffer to second GPU and run second half.

0 Likes

Well, that doesn't quite put the nail in the coffin - writing to per kernel thread destination clBuffers, 2.62 ms is attained.   I'd expect an increase like that if the cards are communicating with eachother (which would be bad for these mem bandwidth limited kernels).

0 Likes
jason
Adept III

so I tried sharing the same contexts and using 2 source buffers with 2 destination buffers with 2 queues and 2 kernel launches to 2 different queues for 2 different devices and I've observed that splitting the work between them (output verified correct) still takes 2.x ms.  Each individual computation (say by not doing the second launch) will take only 960usec to complete - and those parts are correct.  The operation is clearly linear in pixels both theoretically and in benchmarking.  I have used clEnqueueMigrateMemObjects for each of the respective source and destination buffers to their respective queues used in their kernel launches help ensure no GPU<->GPU communication is happening..

Here is some pyopencl based sourcecode detailing how this was done, even if you can only read c - this should still be relatively straight forward on what's happening


import pyopencl as cl


import pyopencl.array as clarray


from time import time


import numpy as np


import os


from numpy import uint32, int3


from scipy.misc import imread




frame = imread('frame.png')[:, :, :3].copy()


gframe = green(frame)


gframe = gframe.astype(np.uint8)


img = np.tile(gframe, (1, 1))


img_dtype = img.dtype


dst_dtype = np.dtype(np.int16)


print devices


print queues




#abridged ....


class Response:


    def __init__(self, img_size, src_dtype, dst_dtype):


        #skip pull in of CL_SOURCE and CL_FLAGS generation, see output below for cflags


        self.program = cl.Program(ctx, CL_SOURCE).build(options=CL_FLAGS)


        self.kernel = self.program.response


    def make_input_buffer(self, queue):


        return clarray.empty(queue, self.img_size, dtype=self.src_dtype)


    def make_output_buffer(self, queue):


        return clarray.empty(queue, self.img_size, dtype=self.dst_dtype)


    def make_split_plan(self, queues):


        group_dims = self.get_group_dims()


        plan = VerticalSplitPlan(self.img_size[0], self.img_size[1], queues, group_dims[1], self.kernel_size - 1)


        return plan


    def get_group_dims(self):


        return (32, 😎


    def make_dims(self, vtile_range):


        group_dims = self.get_group_dims()


        h,w = self.img_size


        if vtile_range is None:


            nvert_tiles = divUp(h, group_dims[1])


            vtile_range = (0, nvert_tiles)


        gdims = roundUpToMultiple(w, group_dims[0]), (vtile_range[1] - vtile_range[0]) * group_dims[1]


        global_offset = 0, vtile_range[0] * group_dims[1]


        return gdims, group_dims, global_offset




    def __call__(self, queue, src_img, dst_img = None, wait_for = None, vtile_range = None):


        if dst_img is None:


            dst_img = self.make_output_buffer(queue)


        h,w = self.img_size


        gdims, group_dims, global_offset = self.make_dims(vtile_range)


        event = None


        self.kernel.set_args(np.int32(h), np.int32(w), src_img.data, np.uint32(src_img.strides[0]), np.uint32(3), dst_img.data, np.uint32(dst_img.strides[0]))


        event = cl.enqueue_nd_range_kernel(queue, self.kernel, gdims, group_dims, global_work_offset = global_offset, wait_for = wait_for)


   




platform = cl.get_platforms()[0]


devices = [device for device in platform.get_devices() if device.type == cl.device_type.GPU]


ctx = cl.Context(devices)


queues = [cl.CommandQueue(ctx, device, properties=cl.command_queue_properties.PROFILING_ENABLE | cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) for device in devices]


response = Response(img.shape, img_dtype, dst_dtype)


plan = response.make_split_plan(queues)


cl_src_img = clarray.empty(ctx, img.shape, dtype=img_dtype)


cl_src_imgs = [response.make_input_buffer(queues) for i in range(len(queues))]


cl_dst_imgs = [response.make_output_buffer(queues) for i in range(len(queues))]


cl_dst_img = response.make_output_buffer(queue)


event = cl.enqueue_migrate_mem_objects(queues[0], [cl_dst_img.data, cl_src_img.data], cl.mem_migration_flags.HOST)


events = [event]


for i, (q, dst_img, src_img) in enumerate(zip(queues, cl_dst_imgs, cl_src_imgs)):


    event = cl.enqueue_migrate_mem_objects(q, [dst_img.data, src_img.data], cl.mem_migration_flags.CONTENT_UNDEFINED)


    events.append(event)


cl.wait_for_events(events)


events = None




def core_loop(is_blocking = True, wait_for = None):


    results = [response(queues, cl_src_imgs, cl_dst_imgs, vtile_range = plan.tile_ranges, wait_for = wait_for) for i in range(len(queues[:]))]


    _, events = zip(*results)


    if is_blocking:


        cl.wait_for_events(events)


    return events




%timeit -n 100 -r 15 core_loop(True);




[<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]


[<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]


[<pyopencl._cl.CommandQueue object at 0x7fa454c7c1b0>, <pyopencl._cl.CommandQueue object at 0x7fa454c7c208>]


compile flags: -I includes/ -x clc++ -cl-std=CL1.2 -D AMD_ARCH -D AMD_WAVEFRONT_SIZE=64 -D PIXELT=uchar -D SPIXELT=int -D LDSPIXELT=uint -D TILE_ROWS=8 -D TILE_COLS=32 -D USE_IMAGE2D=False


100 loops, best of 15: 2.03 ms per loop


And it gets weirder.  If I make this already 4MB (2044x2044) image an 8MB image (2044x4088 or 4088x2044), or even a 8*4=32 MB image by altering the above tile line - then I start seeing the reductions I'm looking for creeping in (3.xx vs 6.xx, 13.xx vs 26.xx).  It seems that some overhead exists at 4MB images to where I'm not seeing advantages of dual-gpus on a problem that clearly is able to use it.  Any thoughts on what's causing this overhead?

I figured based on concurrent benchmarking the 2ms kernel in question over 2 gpus at once as seperate programs that this overhead is tied to the context - so if I don't use that context, I will see the speedup I am looking for but that adds alot of programmatic complexity so I hope I can avoid it.

0 Likes

Did you tried use CodeXL for profiling? It will really help to see time line to see if it execute at least partially parallel.

0 Likes

Not a break in the case but I discovered the timeit module added significant enough overhead vs doing it manually:


import time


iters = 100 * 15


times = np.zeros((iters, 2+len(queues)), np.double)


iter = 0


def core_loop(is_blocking = True, wait_for = None, timing = False):


    start = time.clock()


    results = [response(queues, cl_src_imgs, cl_dst_imgs, vtile_range = plan.tile_ranges, wait_for = wait_for) + (time.clock(),) for i in range(len(queues[:]))]


    _, events, ltimes = zip(*results)


    if is_blocking:


        cl.wait_for_events(events)


    if timing:


        global iter


        times[iter, :] = (start,) + ltimes + (time.clock(), )


        iter = iter + 1


    return events



loop_start = time.clock()


for x in range(iters):


    core_loop(True, timing=True)


loop_end = time.clock()


loop_total = loop_end - loop_start


loop_avg = (loop_total / iters)*1e6


timings = np.concatenate((np.diff(times, axis=1), np.diff(times[:, [0, -1]], axis=1)), axis=1) * 1e6


print timings


print loop_avg, np.average(timings[:, -1]), np.std(timings[:, -1])


The output will be all the timings between kernel submissions returning, how long the wait_for_events blocked and the total core_loop function body timing in usec.  The next line will be an average per iteration in usec.

This produces timings as such for the single GPU case (first column is 1 column kernel submission):


[[  67.  725.  792.]


[  97.  674.  771.]


[  99.  682.  781.]


...,


[  98.  679.  777.]


[ 100.  673.  773.]


[  98.  673.  771.]]


801.853333333 773.648 22.2983727956


For the dual GPU case (first 2 columns is 2 kernel submissions)

















[[   63.62.  1603.  1728.]
[   47.59.   623.   729.]
[   47.58.   616.   721.]

...,

















[   48.57.  1333.  1438.]
[   48.57.  1330.  1435.]
[   47.58.  1328.  1433.]]

1431.134 1409.48666667 91.305365791


So you can see the kernel execution should be overlapping and that the performance of the dual case is pretty weird. I've performed codexl benchmarking and am attaching the respective cases.  I've also included a screenshot of the default column views.  It's hard to gleen what's happening in comparison with the figure I showed at the start of the thread.

0 Likes