5 Replies Latest reply on Jan 21, 2015 10:38 PM by jason

    splitting work between multiple GPU devices (image processing, embarrassingly parallel)




      I am doing image processing in real-time contexts and I have 2 GPUs in a laptop to work with (R9 m290X's - 20 CUs each).  I would like to split roughly half of the input rows of each image to each GPU and have them output to the same buffer then glue it back together - the images are sizes of 2044x2044 rowsxcolumns of int16 or int32, single channel, stored in row-major order.


      I tried to split this via 2 kernel calls to 2 queus created with a single shared context, shrinking global_work_size[1]/=2  and global_work_offset[1]+= rows/2 - reading from the same clBuffer and outputing to the same clBuffer (src != dst).  The output ranges are completely non-overlapping.  The input is a little overlapping (sort of like a convolution kernel window's overlap - it's only going to overlap by 5 elements with a kernel dimension of 11).


      I observe the following taking the best of 15 runs of 100 loops (ipython timeit):

      Single GPU:

      global work dims, global offset, group size:

      GPUx: (2044, 2044) (0, 0) (32, 8)

      ~ 2ms with any single GPU - devices[0] or devices[1].


      Multi GPU:

      global work dims, global offset, group size:

      GPU0: (2044, 1022) (0, 0) (32, 8)

      GPU1: (2044, 1022) (0, 1022) (32, 8)


      ~ 3ms with both GPU devices and no shared buffers (I create dummy src and dst clBuffers for each individual kernel call for the sake of benchmarking this)

      ~10ms with both GPU devices and using the shared input clBuffer and shared output clBuffer (again output ranges are completely non-overlapping and input ranges are almost completely non-overlapping).


      I expected a linear speedup, what gives?  I'm getting worse than linear - 1.5 and 5x slower in the above experiments.


      I googled on the topic and mostly got ancient threads but I did find a few bits relating mostly to nvidia's implementation:




      I use events and wait on them only after both kernel's have been submitted.  I wait on both of them before going to the next loop of the benchmark.


      The next experiment would be splitting this over multiple contexts but this looks like a pain to carry through supporting that in code - I'd rather gain some understanding as to why the numbers are as they are before I go off on that.  Running my benchmark program twice, simultaneously, each targeting a single and different GPU does indeed show 2ms for each program individuallly.


      As noted in another thread, I do have to set environmental variable GPU_NUM_COMPUTE_RINGS=1 to get good timings out of GPU0 on par with GPU1.

        • Re: splitting work between multiple GPU devices (image processing, embarrassingly parallel)

          I think your problem is that you are writing to the same buffer. Or at least that is what I understand from your description. Simultaneous writing to same buffer on two devices result in undefined result. It is possible that OpenCL runtime try prevent that so it run first half on first GPU then move the buffer to second GPU and run second half.

          • Re: splitting work between multiple GPU devices (image processing, embarrassingly parallel)

            so I tried sharing the same contexts and using 2 source buffers with 2 destination buffers with 2 queues and 2 kernel launches to 2 different queues for 2 different devices and I've observed that splitting the work between them (output verified correct) still takes 2.x ms.  Each individual computation (say by not doing the second launch) will take only 960usec to complete - and those parts are correct.  The operation is clearly linear in pixels both theoretically and in benchmarking.  I have used clEnqueueMigrateMemObjects for each of the respective source and destination buffers to their respective queues used in their kernel launches help ensure no GPU<->GPU communication is happening..


            Here is some pyopencl based sourcecode detailing how this was done, even if you can only read c - this should still be relatively straight forward on what's happening

            import pyopencl as cl

            import pyopencl.array as clarray

            from time import time

            import numpy as np

            import os

            from numpy import uint32, int3

            from scipy.misc import imread



            frame = imread('frame.png')[:, :, :3].copy()

            gframe = green(frame)

            gframe = gframe.astype(np.uint8)

            img = np.tile(gframe, (1, 1))

            img_dtype = img.dtype

            dst_dtype = np.dtype(np.int16)

            print devices

            print queues



            #abridged ....

            class Response:

                def __init__(self, img_size, src_dtype, dst_dtype):

                    #skip pull in of CL_SOURCE and CL_FLAGS generation, see output below for cflags

                    self.program = cl.Program(ctx, CL_SOURCE).build(options=CL_FLAGS)

                    self.kernel = self.program.response

                def make_input_buffer(self, queue):

                    return clarray.empty(queue, self.img_size, dtype=self.src_dtype)

                def make_output_buffer(self, queue):

                    return clarray.empty(queue, self.img_size, dtype=self.dst_dtype)

                def make_split_plan(self, queues):

                    group_dims = self.get_group_dims()

                    plan = VerticalSplitPlan(self.img_size[0], self.img_size[1], queues, group_dims[1], self.kernel_size - 1)

                    return plan

                def get_group_dims(self):

                    return (32, 8)

                def make_dims(self, vtile_range):

                    group_dims = self.get_group_dims()

                    h,w = self.img_size

                    if vtile_range is None:

                        nvert_tiles = divUp(h, group_dims[1])

                        vtile_range = (0, nvert_tiles)

                    gdims = roundUpToMultiple(w, group_dims[0]), (vtile_range[1] - vtile_range[0]) * group_dims[1]

                    global_offset = 0, vtile_range[0] * group_dims[1]

                    return gdims, group_dims, global_offset



                def __call__(self, queue, src_img, dst_img = None, wait_for = None, vtile_range = None):

                    if dst_img is None:

                        dst_img = self.make_output_buffer(queue)

                    h,w = self.img_size

                    gdims, group_dims, global_offset = self.make_dims(vtile_range)

                    event = None

                    self.kernel.set_args(np.int32(h), np.int32(w), src_img.data, np.uint32(src_img.strides[0]), np.uint32(3), dst_img.data, np.uint32(dst_img.strides[0]))

                    event = cl.enqueue_nd_range_kernel(queue, self.kernel, gdims, group_dims, global_work_offset = global_offset, wait_for = wait_for)




            platform = cl.get_platforms()[0]

            devices = [device for device in platform.get_devices() if device.type == cl.device_type.GPU]

            ctx = cl.Context(devices)

            queues = [cl.CommandQueue(ctx, device, properties=cl.command_queue_properties.PROFILING_ENABLE | cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) for device in devices]

            response = Response(img.shape, img_dtype, dst_dtype)

            plan = response.make_split_plan(queues)

            cl_src_img = clarray.empty(ctx, img.shape, dtype=img_dtype)

            cl_src_imgs = [response.make_input_buffer(queues[i]) for i in range(len(queues))]

            cl_dst_imgs = [response.make_output_buffer(queues[i]) for i in range(len(queues))]

            cl_dst_img = response.make_output_buffer(queue)

            event = cl.enqueue_migrate_mem_objects(queues[0], [cl_dst_img.data, cl_src_img.data], cl.mem_migration_flags.HOST)

            events = [event]

            for i, (q, dst_img, src_img) in enumerate(zip(queues, cl_dst_imgs, cl_src_imgs)):

                event = cl.enqueue_migrate_mem_objects(q, [dst_img.data, src_img.data], cl.mem_migration_flags.CONTENT_UNDEFINED)



            events = None



            def core_loop(is_blocking = True, wait_for = None):

                results = [response(queues[i], cl_src_imgs[i], cl_dst_imgs[i], vtile_range = plan.tile_ranges[i], wait_for = wait_for) for i in range(len(queues[:]))]

                _, events = zip(*results)

                if is_blocking:


                return events



            %timeit -n 100 -r 15 core_loop(True);



            [<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]

            [<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]

            [<pyopencl._cl.CommandQueue object at 0x7fa454c7c1b0>, <pyopencl._cl.CommandQueue object at 0x7fa454c7c208>]

            compile flags: -I includes/ -x clc++ -cl-std=CL1.2 -D AMD_ARCH -D AMD_WAVEFRONT_SIZE=64 -D PIXELT=uchar -D SPIXELT=int -D LDSPIXELT=uint -D TILE_ROWS=8 -D TILE_COLS=32 -D USE_IMAGE2D=False

            100 loops, best of 15: 2.03 ms per loop

            And it gets weirder.  If I make this already 4MB (2044x2044) image an 8MB image (2044x4088 or 4088x2044), or even a 8*4=32 MB image by altering the above tile line - then I start seeing the reductions I'm looking for creeping in (3.xx vs 6.xx, 13.xx vs 26.xx).  It seems that some overhead exists at 4MB images to where I'm not seeing advantages of dual-gpus on a problem that clearly is able to use it.  Any thoughts on what's causing this overhead?


            I figured based on concurrent benchmarking the 2ms kernel in question over 2 gpus at once as seperate programs that this overhead is tied to the context - so if I don't use that context, I will see the speedup I am looking for but that adds alot of programmatic complexity so I hope I can avoid it.

              • Re: splitting work between multiple GPU devices (image processing, embarrassingly parallel)

                Did you tried use CodeXL for profiling? It will really help to see time line to see if it execute at least partially parallel.

                  • Re: Re: splitting work between multiple GPU devices (image processing, embarrassingly parallel)

                    Not a break in the case but I discovered the timeit module added significant enough overhead vs doing it manually:

                    import time

                    iters = 100 * 15

                    times = np.zeros((iters, 2+len(queues)), np.double)

                    iter = 0

                    def core_loop(is_blocking = True, wait_for = None, timing = False):

                        start = time.clock()

                        results = [response(queues[i], cl_src_imgs[i], cl_dst_imgs[i], vtile_range = plan.tile_ranges[i], wait_for = wait_for) + (time.clock(),) for i in range(len(queues[:]))]

                        _, events, ltimes = zip(*results)

                        if is_blocking:


                        if timing:

                            global iter

                            times[iter, :] = (start,) + ltimes + (time.clock(), )

                            iter = iter + 1

                        return events


                    loop_start = time.clock()

                    for x in range(iters):

                        core_loop(True, timing=True)

                    loop_end = time.clock()

                    loop_total = loop_end - loop_start

                    loop_avg = (loop_total / iters)*1e6

                    timings = np.concatenate((np.diff(times, axis=1), np.diff(times[:, [0, -1]], axis=1)), axis=1) * 1e6

                    print timings

                    print loop_avg, np.average(timings[:, -1]), np.std(timings[:, -1])

                    The output will be all the timings between kernel submissions returning, how long the wait_for_events blocked and the total core_loop function body timing in usec.  The next line will be an average per iteration in usec.


                    This produces timings as such for the single GPU case (first column is 1 column kernel submission):

                    [[  67.  725.  792.]

                    [  97.  674.  771.]

                    [  99.  682.  781.]


                    [  98.  679.  777.]

                    [ 100.  673.  773.]

                    [  98.  673.  771.]]

                    801.853333333 773.648 22.2983727956


                    For the dual GPU case (first 2 columns is 2 kernel submissions)

                    [[   63. 62.  1603.  1728.]
                    [   47. 59.   623.   729.]
                    [   47. 58.   616.   721.]


                    [   48. 57.  1333.  1438.]
                    [   48. 57.  1330.  1435.]
                    [   47. 58.  1328.  1433.]]

                    1431.134 1409.48666667 91.305365791


                    So you can see the kernel execution should be overlapping and that the performance of the dual case is pretty weird. I've performed codexl benchmarking and am attaching the respective cases.  I've also included a screenshot of the default column views.  It's hard to gleen what's happening in comparison with the figure I showed at the start of the thread.