Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Combination of Task Parallelism and Data Parallelism.

    In my Image Segmentation code, I have divided the image into  4 parts(ie if there are 4000 pixels, each part is of 1000 pixels). I have 8 kernels in my code... first 4 of which are to be executed in parallel and the next four again in parallel but after the first four kernels get executed. Is this possible if I use same command queue for all 8 kernels and  specify a  clEnqueueNDRangekernel command for each of the first  four kernels and I mention OUT_OF_ORDER argument while creating the command queue...? And if this is possible how to execute the next four kernels in parallel, that are to executed after first four kernels..? Can I give a clWaitForEvents command after the first four kernels and then specify the next four kernels..? will this guarantee that the first four kernels are executed in parallel and the next four are executed after them but in parallel..? 

    I think clEnqueueTask would make my code slow, since I have about 1000 pixels in each kernel and clEnqueueTask allows gobal_workitem_size and the local_work_item_size to be just 1....!

   I am not sure whether all these things can be done...and what is wrong or right... so I just need a confirmation...!  But if not in this way please suggest an alternative way...!

1 Solution

if you can execute task with single kernel then best option is do so. leave task parallelization on OpenCL and don't bother with it. each kernel launch bring little start overhead so four kernel have higher overhead than single one. what lead you to believe that you idea can bring any speed up?

View solution in original post

12 Replies

OUT_OF_ORDER has no effect on AMD platforms.

On NVIDIA Platforms -- I think only Kepler cards support simultaneous kernel execution. But knowing NVIDIA and their love toward OpenCL, I am not too sure if they had implemented out of order processing, multiple kernel execution etc.. in OpenCL.

At least in NVIDIA platforms, I know that for simultaneous processing of multiple kernels/data transfers etc.. You need to use multiple queues. So enqueue 4 independent kernels in 4 dfferent queues.

Enqueue the next set after you finish all these (clFinish() on a each command queue)

(or) Construct an event list of all 4 kernel events and make use of it in "clEnqueue()" for the rest 4 kernels.

Hope this helps.

If I create a separate command queue for each of the 8 kernels all of them will execute in parallel... to avoid this and make the other four kernels to execute after the first four kernels where exactly I give the cl_finish() command... after each kernel or after the first four kernels..?... Please tell this in bit detail..

   Also I dint understand the (or ).part... did you want to say use clEnqueueTask..? Cause  I think event list is not allowed for clEnqueueNDRangeKernel...!


Please reply asap...!

Event wait List is what you should be using to enforce dependencies.

Please check any clEnqueue* API. (in your case - clEnqueueNDRangeKernel() API).

The last 3 arguments specify a wait list.

You can use that for synchronizing your operations.



Are they all different kernels? Because I don't see point divide workload if it is on the same device. It introduce only additional overhead. but if you have 8 kernels which can be executed in parallel then another 4 which can be also executed in parallel but after first 8 then pass events from first group of EnququeueNDRange() as event_wait_list to second group of EnueueNDRange(). then call clFlush() on all queues and after that clFinish()/clWaitEvents()

Hi Nou,

    My first 4 kernels do the same work, but they just do it on different set of data. Even the other 4 kernels do the same work but for different data. Actually I cant execute the next four kernels before itself ,cz the data to be input for them is from the first 4 kernels.(It is like 1st kernel data to 5th kernel, 2nd to 6th and so on).

   let me tell you exactly what I am suppose to do. I have an image of 512X512 size. I have divided like each kernel would be having 256X256 pixels. This creates 4 parts. I do same processing on all 4 sets of 256X256 pixels(i.e. the code for each set is the same but even then I create different kernels for each set for more parallelism ), hence I think if I give one command queue for each of these sets, they would be executed in parallel on different compute units and  for each kernel enqueued in different command queue, a separate clEnqueueNDRangeKernel call will be used..!

   for the other 4 set I do the same thing. Is such a task parallelism possible or will this just increase the overhead as you say...? I mean will the 4 sets be executed in parallel if given in different command queues..?


Hey Nou,

   Please do take out time to go through the post... U ve pointed to a really serious point about "really parallel or just an overhead" and I want to get it cleared...!


you don't need enqueue multiple kernels to utilize more compute units. each compute units get one or more work-group. each work-group can consist up to 256 work-items on current AMD HW. i don't see why you want execute it this way. this approach is gain only when you can't fully utilize device when you don't have enough work-items.


But will it not achieve speedup if I execute it in this way...? My each part would contain 256X256X8 edge weights and the whole image contains 512X512X8 edge weights, so I'm thinking of optimizing the code and achieving speedups..! So will this way give a speedup or no?


if you can execute task with single kernel then best option is do so. leave task parallelization on OpenCL and don't bother with it. each kernel launch bring little start overhead so four kernel have higher overhead than single one. what lead you to believe that you idea can bring any speed up?

I actually thought that if I create separate command queues for each kernel they would be executed in parallel. I actually think of this, since small amount of data is always  easy to handle for a kernel than a big amount of data..., which will eventually increase speed of the code. But my question is that do they really execute in parallel if multiple command queues are created or if an event wait list is created..?


1. The amount of compute power in GPU is constant.

2. You cannot change this by increasing the number of command queues.

3. GPU loves to process huge data. It is a myth that breaking down to small chunks will help.

     oiow, GPU is a wholesaler. Not a retailer.