AnsweredAssumed Answered

Using Gizmo 2 with OpenCL, DMA, ZeroCopy, and kernel invokation take perplexing amounts of time

Question asked by abc on Jul 24, 2015
Latest reply on Jul 30, 2015 by dipak



Trying to use the show cased AMD reduction algorithm (two stepped), and on even the first step, when compared to naive CPU implementation, ie for loop with a swap for every value larger than the current found max, it takes a long time, when profiling with events it takes a really short amount of time (matter of nano seconds), but when using Clock() based timing (start = clock(), end = clock() runtime = (end-start/clock_cycles)) I find that it takes several micro seconds to execute, and when testing launching an empty kernel I find that it still takes the same amount of time in mircoseconds to launch, looking online it appears there is no way around this? It must take this long to launch?.


The problem I've had with this, is that with reducing an 1D array of 256 items results in 24 microsecond run time, where the serialized version on the Gizmo CPU takes 7 microseconds. Increasing this to 16384 elements (256 x 64 items) it will take 30 microseconds to launch and complete the kernel, and 130 - 200 microseconds for the serialized version, with out taking copying from and to the device into account


With taking copies into account, an extra 200 micro second runtime cost either way is added, and It eventually adds up to around 2000 microseconds, even if I make a blank kernel.


I've been trying to use DMA and Zero Copy to reduce this overhead to hopefully nearly zero, but I'm relatively new to this and I've found that when trying to implement any of these techniques never seem to actually reduce the run time.


Note that my current understanding of DMA in this context is that it is the same as Zero Copy, in zero copy you directly get data from either host to device or device to host, DMA is direct memory access from device to host or host to device

Here is my current implementation (psuedo code for KISS)



global int buffer_in

global int buffer_result {


buffer_result[0] = buffer_in[global_id] + 8;




My current understanding is that all the threads will try to write to buffer_result[0] and  that shouldn't cause performance hitches since it isn't using atomics, so they will all just write at the same time (or 80 should) , and only buffer_result[0] will be edited





int * host_ptr;

posix_memalign(host_ptr, 4096, sizeof(int) * 16384) //should result in 256 byte multiple


int host_ptr array as all 1s;


cl_mem buffer_in, buffer_result;


buffer_in = clCreateBuffer(CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, sizeof(int) * 16384, host_ptr);

//64 is arbitrary, don't care what those values are I just want to see them returned

buffer_result = clCreateBuffer(CL_MEM_USE_PERSISTENT_MEM_AMD, sizeof(int) * 64);

set kernel arguments buffer_in, buffer_result

launch kernel

int *ptr


clEnqueueReadBuffer(buffer_result, sizeof(int) * 64, ptr);








(I cannot use gui profiling, things like CodeXL and some of AMDs programs, I have to use code based profiling since I'm using gizmo 2 board, unless I'm missing something)