Hi!
I am doing image processing in real-time contexts and I have 2 GPUs in a laptop to work with (R9 m290X's - 20 CUs each). I would like to split roughly half of the input rows of each image to each GPU and have them output to the same buffer then glue it back together - the images are sizes of 2044x2044 rowsxcolumns of int16 or int32, single channel, stored in row-major order.
I tried to split this via 2 kernel calls to 2 queus created with a single shared context, shrinking global_work_size[1]/=2 and global_work_offset[1]+= rows/2 - reading from the same clBuffer and outputing to the same clBuffer (src != dst). The output ranges are completely non-overlapping. The input is a little overlapping (sort of like a convolution kernel window's overlap - it's only going to overlap by 5 elements with a kernel dimension of 11).
I observe the following taking the best of 15 runs of 100 loops (ipython timeit):
Single GPU:
global work dims, global offset, group size:
GPUx: (2044, 2044) (0, 0) (32, 😎
~ 2ms with any single GPU - devices[0] or devices[1].
Multi GPU:
global work dims, global offset, group size:
GPU0: (2044, 1022) (0, 0) (32, 😎
GPU1: (2044, 1022) (0, 1022) (32, 😎
~ 3ms with both GPU devices and no shared buffers (I create dummy src and dst clBuffers for each individual kernel call for the sake of benchmarking this)
~10ms with both GPU devices and using the shared input clBuffer and shared output clBuffer (again output ranges are completely non-overlapping and input ranges are almost completely non-overlapping).
I expected a linear speedup, what gives? I'm getting worse than linear - 1.5 and 5x slower in the above experiments.
I googled on the topic and mostly got ancient threads but I did find a few bits relating mostly to nvidia's implementation:
I use events and wait on them only after both kernel's have been submitted. I wait on both of them before going to the next loop of the benchmark.
The next experiment would be splitting this over multiple contexts but this looks like a pain to carry through supporting that in code - I'd rather gain some understanding as to why the numbers are as they are before I go off on that. Running my benchmark program twice, simultaneously, each targeting a single and different GPU does indeed show 2ms for each program individuallly.
As noted in another thread, I do have to set environmental variable GPU_NUM_COMPUTE_RINGS=1 to get good timings out of GPU0 on par with GPU1.
I think your problem is that you are writing to the same buffer. Or at least that is what I understand from your description. Simultaneous writing to same buffer on two devices result in undefined result. It is possible that OpenCL runtime try prevent that so it run first half on first GPU then move the buffer to second GPU and run second half.
Well, that doesn't quite put the nail in the coffin - writing to per kernel thread destination clBuffers, 2.62 ms is attained. I'd expect an increase like that if the cards are communicating with eachother (which would be bad for these mem bandwidth limited kernels).
so I tried sharing the same contexts and using 2 source buffers with 2 destination buffers with 2 queues and 2 kernel launches to 2 different queues for 2 different devices and I've observed that splitting the work between them (output verified correct) still takes 2.x ms. Each individual computation (say by not doing the second launch) will take only 960usec to complete - and those parts are correct. The operation is clearly linear in pixels both theoretically and in benchmarking. I have used clEnqueueMigrateMemObjects for each of the respective source and destination buffers to their respective queues used in their kernel launches help ensure no GPU<->GPU communication is happening..
Here is some pyopencl based sourcecode detailing how this was done, even if you can only read c - this should still be relatively straight forward on what's happening
import pyopencl as cl
import pyopencl.array as clarray
from time import time
import numpy as np
import os
from numpy import uint32, int3
from scipy.misc import imread
frame = imread('frame.png')[:, :, :3].copy()
gframe = green(frame)
gframe = gframe.astype(np.uint8)
img = np.tile(gframe, (1, 1))
img_dtype = img.dtype
dst_dtype = np.dtype(np.int16)
print devices
print queues
#abridged ....
class Response:
def __init__(self, img_size, src_dtype, dst_dtype):
#skip pull in of CL_SOURCE and CL_FLAGS generation, see output below for cflags
self.program = cl.Program(ctx, CL_SOURCE).build(options=CL_FLAGS)
self.kernel = self.program.response
def make_input_buffer(self, queue):
return clarray.empty(queue, self.img_size, dtype=self.src_dtype)
def make_output_buffer(self, queue):
return clarray.empty(queue, self.img_size, dtype=self.dst_dtype)
def make_split_plan(self, queues):
group_dims = self.get_group_dims()
plan = VerticalSplitPlan(self.img_size[0], self.img_size[1], queues, group_dims[1], self.kernel_size - 1)
return plan
def get_group_dims(self):
return (32, 😎
def make_dims(self, vtile_range):
group_dims = self.get_group_dims()
h,w = self.img_size
if vtile_range is None:
nvert_tiles = divUp(h, group_dims[1])
vtile_range = (0, nvert_tiles)
gdims = roundUpToMultiple(w, group_dims[0]), (vtile_range[1] - vtile_range[0]) * group_dims[1]
global_offset = 0, vtile_range[0] * group_dims[1]
return gdims, group_dims, global_offset
def __call__(self, queue, src_img, dst_img = None, wait_for = None, vtile_range = None):
if dst_img is None:
dst_img = self.make_output_buffer(queue)
h,w = self.img_size
gdims, group_dims, global_offset = self.make_dims(vtile_range)
event = None
self.kernel.set_args(np.int32(h), np.int32(w), src_img.data, np.uint32(src_img.strides[0]), np.uint32(3), dst_img.data, np.uint32(dst_img.strides[0]))
event = cl.enqueue_nd_range_kernel(queue, self.kernel, gdims, group_dims, global_work_offset = global_offset, wait_for = wait_for)
platform = cl.get_platforms()[0]
devices = [device for device in platform.get_devices() if device.type == cl.device_type.GPU]
ctx = cl.Context(devices)
queues = [cl.CommandQueue(ctx, device, properties=cl.command_queue_properties.PROFILING_ENABLE | cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) for device in devices]
response = Response(img.shape, img_dtype, dst_dtype)
plan = response.make_split_plan(queues)
cl_src_img = clarray.empty(ctx, img.shape, dtype=img_dtype)
cl_src_imgs = [response.make_input_buffer(queues) for i in range(len(queues))]
cl_dst_imgs = [response.make_output_buffer(queues) for i in range(len(queues))]
cl_dst_img = response.make_output_buffer(queue)
event = cl.enqueue_migrate_mem_objects(queues[0], [cl_dst_img.data, cl_src_img.data], cl.mem_migration_flags.HOST)
events = [event]
for i, (q, dst_img, src_img) in enumerate(zip(queues, cl_dst_imgs, cl_src_imgs)):
event = cl.enqueue_migrate_mem_objects(q, [dst_img.data, src_img.data], cl.mem_migration_flags.CONTENT_UNDEFINED)
events.append(event)
cl.wait_for_events(events)
events = None
def core_loop(is_blocking = True, wait_for = None):
results = [response(queues, cl_src_imgs, cl_dst_imgs, vtile_range = plan.tile_ranges, wait_for = wait_for) for i in range(len(queues[:]))]
_, events = zip(*results)
if is_blocking:
cl.wait_for_events(events)
return events
%timeit -n 100 -r 15 core_loop(True);
[<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]
[<pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x167a210>, <pyopencl.Device 'Pitcairn' on 'AMD Accelerated Parallel Processing' at 0x1b19a00>]
[<pyopencl._cl.CommandQueue object at 0x7fa454c7c1b0>, <pyopencl._cl.CommandQueue object at 0x7fa454c7c208>]
compile flags: -I includes/ -x clc++ -cl-std=CL1.2 -D AMD_ARCH -D AMD_WAVEFRONT_SIZE=64 -D PIXELT=uchar -D SPIXELT=int -D LDSPIXELT=uint -D TILE_ROWS=8 -D TILE_COLS=32 -D USE_IMAGE2D=False
100 loops, best of 15: 2.03 ms per loop
And it gets weirder. If I make this already 4MB (2044x2044) image an 8MB image (2044x4088 or 4088x2044), or even a 8*4=32 MB image by altering the above tile line - then I start seeing the reductions I'm looking for creeping in (3.xx vs 6.xx, 13.xx vs 26.xx). It seems that some overhead exists at 4MB images to where I'm not seeing advantages of dual-gpus on a problem that clearly is able to use it. Any thoughts on what's causing this overhead?
I figured based on concurrent benchmarking the 2ms kernel in question over 2 gpus at once as seperate programs that this overhead is tied to the context - so if I don't use that context, I will see the speedup I am looking for but that adds alot of programmatic complexity so I hope I can avoid it.
Did you tried use CodeXL for profiling? It will really help to see time line to see if it execute at least partially parallel.
Not a break in the case but I discovered the timeit module added significant enough overhead vs doing it manually:
import time
iters = 100 * 15
times = np.zeros((iters, 2+len(queues)), np.double)
iter = 0
def core_loop(is_blocking = True, wait_for = None, timing = False):
start = time.clock()
results = [response(queues, cl_src_imgs, cl_dst_imgs, vtile_range = plan.tile_ranges, wait_for = wait_for) + (time.clock(),) for i in range(len(queues[:]))]
_, events, ltimes = zip(*results)
if is_blocking:
cl.wait_for_events(events)
if timing:
global iter
times[iter, :] = (start,) + ltimes + (time.clock(), )
iter = iter + 1
return events
loop_start = time.clock()
for x in range(iters):
core_loop(True, timing=True)
loop_end = time.clock()
loop_total = loop_end - loop_start
loop_avg = (loop_total / iters)*1e6
timings = np.concatenate((np.diff(times, axis=1), np.diff(times[:, [0, -1]], axis=1)), axis=1) * 1e6
print timings
print loop_avg, np.average(timings[:, -1]), np.std(timings[:, -1])
The output will be all the timings between kernel submissions returning, how long the wait_for_events blocked and the total core_loop function body timing in usec. The next line will be an average per iteration in usec.
This produces timings as such for the single GPU case (first column is 1 column kernel submission):
[[ 67. 725. 792.]
[ 97. 674. 771.]
[ 99. 682. 781.]
...,
[ 98. 679. 777.]
[ 100. 673. 773.]
[ 98. 673. 771.]]
801.853333333 773.648 22.2983727956
For the dual GPU case (first 2 columns is 2 kernel submissions)
[[ 63. 62. 1603. 1728.] [ 47. 59. 623. 729.] [ 47. 58. 616. 721.] ...,
[ 48. 57. 1333. 1438.] [ 48. 57. 1330. 1435.] [ 47. 58. 1328. 1433.]] 1431.134 1409.48666667 91.305365791
So you can see the kernel execution should be overlapping and that the performance of the dual case is pretty weird. I've performed codexl benchmarking and am attaching the respective cases. I've also included a screenshot of the default column views. It's hard to gleen what's happening in comparison with the figure I showed at the start of the thread.