cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

abc
Journeyman III

Using Gizmo 2 with OpenCL, DMA, ZeroCopy, and kernel invokation take perplexing amounts of time

Hello,

Trying to use the show cased AMD reduction algorithm (two stepped), and on even the first step, when compared to naive CPU implementation, ie for loop with a swap for every value larger than the current found max, it takes a long time, when profiling with events it takes a really short amount of time (matter of nano seconds), but when using Clock() based timing (start = clock(), end = clock() runtime = (end-start/clock_cycles)) I find that it takes several micro seconds to execute, and when testing launching an empty kernel I find that it still takes the same amount of time in mircoseconds to launch, looking online it appears there is no way around this? It must take this long to launch?.

The problem I've had with this, is that with reducing an 1D array of 256 items results in 24 microsecond run time, where the serialized version on the Gizmo CPU takes 7 microseconds. Increasing this to 16384 elements (256 x 64 items) it will take 30 microseconds to launch and complete the kernel, and 130 - 200 microseconds for the serialized version, with out taking copying from and to the device into account

With taking copies into account, an extra 200 micro second runtime cost either way is added, and It eventually adds up to around 2000 microseconds, even if I make a blank kernel.

I've been trying to use DMA and Zero Copy to reduce this overhead to hopefully nearly zero, but I'm relatively new to this and I've found that when trying to implement any of these techniques never seem to actually reduce the run time.

Note that my current understanding of DMA in this context is that it is the same as Zero Copy, in zero copy you directly get data from either host to device or device to host, DMA is direct memory access from device to host or host to device

Here is my current implementation (psuedo code for KISS)

Kernel

global int buffer_in

global int buffer_result {

buffer_result[0] = buffer_in[global_id] + 8;

}

My current understanding is that all the threads will try to write to buffer_result[0] and  that shouldn't cause performance hitches since it isn't using atomics, so they will all just write at the same time (or 80 should) , and only buffer_result[0] will be edited

main

{

int * host_ptr;

posix_memalign(host_ptr, 4096, sizeof(int) * 16384) //should result in 256 byte multiple

int host_ptr array as all 1s;

cl_mem buffer_in, buffer_result;

buffer_in = clCreateBuffer(CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, sizeof(int) * 16384, host_ptr);

//64 is arbitrary, don't care what those values are I just want to see them returned

buffer_result = clCreateBuffer(CL_MEM_USE_PERSISTENT_MEM_AMD, sizeof(int) * 64);

set kernel arguments buffer_in, buffer_result

launch kernel

int *ptr

clEnqueueReadBuffer(buffer_result, sizeof(int) * 64, ptr);

print(ptr[0]);

}

(I cannot use gui profiling, things like CodeXL and some of AMDs programs, I have to use code based profiling since I'm using gizmo 2 board, unless I'm missing something)

0 Likes
1 Solution

Yes, it is expected that clock based timing may be higher than event based timing. As completion of kernel execution is a synchronization point, the runtime may need to perform some extra works after kernel computation to make the memory objects consistent within the context. This takes few extra overhead and the actual number depends on particular scenario. However, there should not be huge disparity.

Normally it is observed that first kernel invocation takes much longer time compared to sub-sequent calls. Actually runtime/driver performs some extra (hidden) steps implicitly during that time. One example is "delayed buffer creation" where the runtime may choose not to create the device buffer until the buffer is actually referred or used by the kernel or any other command.

In many APUs, the host and device memory are logically partitioned, though, they share same physical memory. So, it adds few overhead. In such system, device can directly access the host memory but the actual data path (data bus) may vary depending on the access pattern and type of memory.

View solution in original post

0 Likes
5 Replies
jtrudeau
Staff

Welcome! You've been white listed, and I moved this over to the OpenCL forum.

0 Likes
dipak
Big Boss

Couple of questions in this regard:

1) How close is the event based complete kernel execution time (CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED) compared to  "Clock" based CPU measured time?

2) If you run the same kernel multiple times (i.e. inside a loop), do you see any improvement (especially in terms of launch time) for subsequent runs compared to first run?

Regards,

0 Likes
abc
Journeyman III

Rarely has it ever been close to kernel execution time, though increased kernel execution time does seem to correlate to clock based cpu time, they both get higher the larger kernel execution is.  i assumed that the CPU time was capturing kernel invokation costs and sending the code to the GPU, while the event data only captured the real kernel runtime.

If I run the kernel multiple times I DO get a performance increase, in fact I can clearly see the copying happening at some point during the operation, since the time decreases for launching the kernel on the CPU side from 260 micro second to 30 micro seconds consistently.  If I change the data on the host however (mapping/unmapping) it will again cause a copy operations to be done (despite me never explicitly telling OpenCL to do so)

0 Likes

Yes, it is expected that clock based timing may be higher than event based timing. As completion of kernel execution is a synchronization point, the runtime may need to perform some extra works after kernel computation to make the memory objects consistent within the context. This takes few extra overhead and the actual number depends on particular scenario. However, there should not be huge disparity.

Normally it is observed that first kernel invocation takes much longer time compared to sub-sequent calls. Actually runtime/driver performs some extra (hidden) steps implicitly during that time. One example is "delayed buffer creation" where the runtime may choose not to create the device buffer until the buffer is actually referred or used by the kernel or any other command.

In many APUs, the host and device memory are logically partitioned, though, they share same physical memory. So, it adds few overhead. In such system, device can directly access the host memory but the actual data path (data bus) may vary depending on the access pattern and type of memory.

0 Likes

PS:

Usually kernel invocation overhead for GPU devices is higher than CPU devices. That why, sometimes its better to choose CPU device compared to GPU when the kernel is small (not much computation required) and ND range size is low.

Regards,

0 Likes