cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

alariq
Adept I

use of events

Hello, All

I have several questions regarding "best practices" use of events.

Let's consider next workflow

kernel A1 -> kernel A2  - kernel A2 depends on result of kernelA1

kernel B1 -> kernel B2  - kernel B2 depends on result of kernelB1

kernels As and Bs are not connected in any way

And this code example:

for(i=0...N)

{

   setparams(A1, A1params, i);

   setparams(A2, A2params, i);

   runkernel(A1, i)

   runkernel(A2, i)

   setparams(B1, B1params, i);

   setparams(B2, B2params, i);

   runkernel(B1, i)

   runkernel(B2, i)

}

clFinish(queue);

dump_results(A2buf, B2buf);

I've added "i" as a parameter to setparams() and runkernel() to tell that in general case kernel parameters and kernel configuration may be different depending on the interaion number.

Now I have several assumtions (please tell me if I am wrong):

Assuming In-Order queue:

1) We do not need to use events at all (no sense) . In terms that all kernels wil be executed sequentially in strictly same order as they were added into the queue. Even if we would add dependency of type A1 -> B1 -> A1[i+1] -> ...  (same for A2,B2)  using events no kernels would be executed in parallel even if it was possible (A1,B1 are completely independent from A2,B2).

2) Instead of clFinish() I could add one event to the last enqued kernel (B2 in Nth iter) and use:

clFlush(queue);

clWaitForEvents(B2_Nth_iter_event,1)

to acheive the same effect.

Assuming Out-Of-Order queue:

1) I would definitely need to use events to explicitly express dependency between A1->B1 and A2->B2. but anyway all kernels would be executed in a sequential way even though without ordering guarranty as soon as dependency specified by events will be satisfied.

2) Number of events will be proportional to the number of iterations because of A->B->A[i+1]->B[i+1] ... -> A[i+k] -> B[i|+k] dependency

3) Unlike for In-Order queue if I substitute clFinish() for clFlush() I would need to wait for 2 last events: one for A1->B1 chain and one for A2->B2 chain.

Also one question about reusing events (imagine same scenario but a little bit transposed).

for(i=0...N)

{

   setparams(A1, A1params, i);

   setparams(B1, B1params, i);

   runkernel(A1, i)

   runkernel(B1, i)

}

for(i=0...N)

{

   setparams(A2, A2params, i);

   setparams(B2, B2params, i);

   runkernel(A2, i)

   runkernel(B2, i)

}

Suppose I want to proceed with events. So have 2*N events for the first cycle + 2*N for second cycle.

Now suppose i want to reuse events from first cycle.

Is it necessary to clReleaseEvent them before reusing them in second cycle? ( I assume YES 🙂 )

In other words when passing cl_event to clEnqueueNDRagngeKernel(..., &ev); ev should NOT be already initialized event right?

Also, is it much overhead if I have e.g. N=15, so about 30 events?

Thanks in advance.

0 Likes
10 Replies
german
Staff

  • Your assumptions about in-order and out_of_order executions are correct.
  • AMD runtime doesn't support out_of_order execution. Is there any particular reason for the out_of_order model in your application? Basically I would suggest 2 queues instead or even a single queue. SI supports the concurrent
    executions  even on the same queue if it's possible. The condition is the kernels followed each other don't utilize all compute units (CU) and don't have dependency.
  • Events usage model depends from the C interface or C++ bindings. Yes, you have to call clReleaseEvent in the C interface(your example). However the C++ bindings allow to reuse the cl::Event objects. Basically the implementation (see cl.hpp file) just hides clReleaseEvent from the application.
  • The overhead for events shouldn't be big, but may vary from the application logic (let's say multiple queues synchronization). Try to avoid unnecessary events/synchronizations.

>> SI supports the concurrent executions  even on the same queue if it's possible.

>> The condition is the kernels followed each other don't utilize all compute units (CU)

>> and don't have dependency.

This will enforce an un-necessary code-path on the OpenCL applications which would want to perform optimal on SI cards.

I will have to cut down by global_size() depending on whether I have SI card (or) not.

Also, If SI could support OpenCL "sub devices" - it would be a lot better.

Any goodies on this side?

As of now, I believe, only CPU devices can be sliced to create "sub devices" (on AMD platform)

0 Likes

himanshu.gautam wrote:

>> SI supports the concurrent executions  even on the same queue if it's possible.

>> The condition is the kernels followed each other don't utilize all compute units (CU)

>> and don't have dependency.

This will enforce an un-necessary code-path on the OpenCL applications which would want to perform optimal on SI cards.

I will have to cut down by global_size() depending on whether I have SI card (or) not.

Also, If SI could support OpenCL "sub devices" - it would be a lot better.

Any goodies on this side?

As of now, I believe, only CPU devices can be sliced to create "sub devices" (on AMD platform)

I don't see any reason to change the application logic. The feature should enforce nothing.

What kind of task is suitable for the subdevices and can't be resolved with async multiple queues on SI?

The sub-devices on GPU can be a "nice" feature. But, is there a real issue that can't be solved with the current functionality?

0 Likes

>>  The condition is the kernels followed each other don't utilize all compute units (CU) 

>> and don't have dependency

I assumed that "dont utilize all compute units" -- has direct bearing on get_global_size() (or the total number of threads spawned for a kernel)

OR Have I mist-understood?

0 Likes

That's correct. But why do you need to cut the original global size? You just send whatever the logic requires with cutting nothing. Let's say if the app launches a kernel with only 1024 elements, then the following kernel can start earlier and all CUs will be busy. But it doesn't mean you have to cut 10K elements into 10 groups with 1024 elements each.

The async mode between 2 CPs is slightly different. The GPU schedules workgroups itself from the both command queues. It even allows to run the graphics submissions on the third CP.

0 Likes

Thanks. That sounds great. Scheduling is at the work-group granularity (does not matter which kernel it belongs to). This is simply efficient! Great to know! Thanks,

0 Likes

Thanks guys for replies.

I was also considering using 2 queues, maybe i'll try it later to see the diff.

About:

>> The condition is the kernels followed each other don't utilize all compute units (CU) and don't have dependency.

I heard about it, however could not find formal description of what does it means to "have no dependency" 🙂

In particular if I have buffer as CL_MEM_READ_WRITE but declare it as __global const char* in 2 kernels and have no aliases, so it will not be wtitten by those kernels, will runtime still consider them(kernels) as independent?

Thanks

independent"
Détecter la langue » English
0 Likes

alariq wrote:

Thanks guys for replies.

I was also considering using 2 queues, maybe i'll try it later to see the diff.

About:

>> The condition is the kernels followed each other don't utilize all compute units (CU) and don't have dependency.

I heard about it, however could not find formal description of what does it means to "have no dependency" 🙂

In particular if I have buffer as CL_MEM_READ_WRITE but declare it as __global const char* in 2 kernels and have no aliases, so it will not be wtitten by those kernels, will runtime still consider them(kernels) as independent?

I would assume that there is no dependency, if the buffer is constant. It would be quite awesome if both the kernels can use it at the same time. But these are just speculations. Probably someone more knowledgeable can help here.

Just keep in mind, to test concurrent kernel execution, you will need to have two kernels which are not very demanding on either memory or CUs. I would suggest to have ( Number of CUs / 2) workgroups for both the kernels. Having more workgroups than this is definitely recommended for performance, but may not be vital for testing CKE. Also make sure the kernels are heavily compute bound. If a kernel is memory bound a single kernel might take up the whole global memory bandwidth, thus effectively sequentializing the kernels.

In particular if I have buffer as CL_MEM_READ_WRITE but declare it as __global const char* in 2 kernels and have no aliases, so it will not be wtitten by those kernels, will runtime still consider them(kernels) as independent?

   

That's correct. Any read-only operations on memory are considered as "independent" kernel executions.

0 Likes

Will kernels operating on 2 different sub-buffers beloning to same parent cl_mem object be considered independent.?

The spec, anyway, says that overlapping sub-buffers result in undefined results.

0 Likes