cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

shreedhar_pawar
Adept II

Combination of Task Parallelism and Data Parallelism.

Jump to solution

    In my Image Segmentation code, I have divided the image into  4 parts(ie if there are 4000 pixels, each part is of 1000 pixels). I have 8 kernels in my code... first 4 of which are to be executed in parallel and the next four again in parallel but after the first four kernels get executed. Is this possible if I use same command queue for all 8 kernels and  specify a  clEnqueueNDRangekernel command for each of the first  four kernels and I mention OUT_OF_ORDER argument while creating the command queue...? And if this is possible how to execute the next four kernels in parallel, that are to executed after first four kernels..? Can I give a clWaitForEvents command after the first four kernels and then specify the next four kernels..? will this guarantee that the first four kernels are executed in parallel and the next four are executed after them but in parallel..? 

    I think clEnqueueTask would make my code slow, since I have about 1000 pixels in each kernel and clEnqueueTask allows gobal_workitem_size and the local_work_item_size to be just 1....!

   I am not sure whether all these things can be done...and what is wrong or right... so I just need a confirmation...!  But if not in this way please suggest an alternative way...!

1 Solution

Accepted Solutions
nou
Exemplar

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

if you can execute task with single kernel then best option is do so. leave task parallelization on OpenCL and don't bother with it. each kernel launch bring little start overhead so four kernel have higher overhead than single one. what lead you to believe that you idea can bring any speed up?

View solution in original post

12 Replies
himanshu_gautam
Grandmaster

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

OUT_OF_ORDER has no effect on AMD platforms.

On NVIDIA Platforms -- I think only Kepler cards support simultaneous kernel execution. But knowing NVIDIA and their love toward OpenCL, I am not too sure if they had implemented out of order processing, multiple kernel execution etc.. in OpenCL.

At least in NVIDIA platforms, I know that for simultaneous processing of multiple kernels/data transfers etc.. You need to use multiple queues. So enqueue 4 independent kernels in 4 dfferent queues.

Enqueue the next set after you finish all these (clFinish() on a each command queue)

(or) Construct an event list of all 4 kernel events and make use of it in "clEnqueue()" for the rest 4 kernels.

Hope this helps.

shreedhar_pawar
Adept II

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

If I create a separate command queue for each of the 8 kernels all of them will execute in parallel... to avoid this and make the other four kernels to execute after the first four kernels where exactly I give the cl_finish() command... after each kernel or after the first four kernels..?... Please tell this in bit detail..

   Also I dint understand the (or ).part... did you want to say use clEnqueueTask..? Cause  I think event list is not allowed for clEnqueueNDRangeKernel...!

0 Likes
Reply
shreedhar_pawar
Adept II

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

Please reply asap...!

nou
Exemplar

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

Are they all different kernels? Because I don't see point divide workload if it is on the same device. It introduce only additional overhead. but if you have 8 kernels which can be executed in parallel then another 4 which can be also executed in parallel but after first 8 then pass events from first group of EnququeueNDRange() as event_wait_list to second group of EnueueNDRange(). then call clFlush() on all queues and after that clFinish()/clWaitEvents()

shreedhar_pawar
Adept II

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

Hi Nou,

    My first 4 kernels do the same work, but they just do it on different set of data. Even the other 4 kernels do the same work but for different data. Actually I cant execute the next four kernels before itself ,cz the data to be input for them is from the first 4 kernels.(It is like 1st kernel data to 5th kernel, 2nd to 6th and so on).

   let me tell you exactly what I am suppose to do. I have an image of 512X512 size. I have divided like each kernel would be having 256X256 pixels. This creates 4 parts. I do same processing on all 4 sets of 256X256 pixels(i.e. the code for each set is the same but even then I create different kernels for each set for more parallelism ), hence I think if I give one command queue for each of these sets, they would be executed in parallel on different compute units and  for each kernel enqueued in different command queue, a separate clEnqueueNDRangeKernel call will be used..!

   for the other 4 set I do the same thing. Is such a task parallelism possible or will this just increase the overhead as you say...? I mean will the 4 sets be executed in parallel if given in different command queues..?

0 Likes
Reply
shreedhar_pawar
Adept II

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

Hey Nou,

   Please do take out time to go through the post... U ve pointed to a really serious point about "really parallel or just an overhead" and I want to get it cleared...!

0 Likes
Reply
nou
Exemplar

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

you don't need enqueue multiple kernels to utilize more compute units. each compute units get one or more work-group. each work-group can consist up to 256 work-items on current AMD HW. i don't see why you want execute it this way. this approach is gain only when you can't fully utilize device when you don't have enough work-items.

0 Likes
Reply
shreedhar_pawar
Adept II

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

But will it not achieve speedup if I execute it in this way...? My each part would contain 256X256X8 edge weights and the whole image contains 512X512X8 edge weights, so I'm thinking of optimizing the code and achieving speedups..! So will this way give a speedup or no?

0 Likes
Reply
nou
Exemplar

Re: Combination of Task Parallelism and Data Parallelism.

Jump to solution

if you can execute task with single kernel then best option is do so. leave task parallelization on OpenCL and don't bother with it. each kernel launch bring little start overhead so four kernel have higher overhead than single one. what lead you to believe that you idea can bring any speed up?

View solution in original post