1 of 1 people found this helpful
I am using clEnqueueSVMMap and a attribute is event , but I can't find how many event types could be used when I used the this API.
Your question is not clear to me. Could you please be more explicit?
[For detail description, please refer clEnqueueSVMMap.]
how could I estimate the execution time of GPU and the transfer or SVMMap time with shared virtual memory?
Each clEnqueue<> API returns an event object that can be used to find out the status of the particular command. One can use clGetEventProfilingInfo for this purpose. However, to enable the profiling, the command queue must be created with
CL_QUEUE_PROFILING_ENABLEflag set in
propertiesargument to clCreateCommandQueueWithProperties.
Another easier and better way of doing so is, use of profiler like AMD's CodeXL. For details, please check http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/codexl-benefits-detail/
There are some event types we can use to profile :
When we use API (clEnqueueNDRangeKernel) , we can use 3 and 4 to estimate to kernel execution time.
I want to know how to choose event types to estimate the transfer time from CPU to GPU when we use API as follow:
clEnqueueWriteBuffer - clEnqueueReadBuffer
clEnqueueSVMMap - clEnqueueSVMUnmap
On kaveri (A10-7850K) , the implementation of USE_HOST_PTR is the same as SVM ?
I wrote a program by using USE_HOST_PTR and SVM , but the total execution time (CPU + GPU) about USE_HOST_PTR is faster than SVM.
So what's the difference between USE_HOST_PTR and SVM on APU (Kaveri).
Yes, difference of CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END can be used to estimate the kernel execution time. This is not limited to clEnqueueNDRangeKernel call only, it is applicable to any command submitted by clEnqueue<<>>, including clEnqueue<Read/Write>Buffer, clEnqueue<Map/Unmap> etc.
They are not exactly same. USE_HOST_PTR acts a pinned host memory. Whereas, SVM allows both host and devices to share the same virtual memory space, hence any SVM pointer can be directly accessed by both host and devices. Runtime takes care everything behind the scene. It also needs special support for memory consistency (for example, IOMMU support). Depending on type of the SVM, the access time may vary greatly. Due to this consistency mechanism, accessing SVM buffer may be costlier than normal pinned host memory.
Use these api (use_host_ptr / alloc_host_ptr / SVM ) would reduce the transfer overhead but not totally eliminate the transfer overhead , right ?
So if i want to estimate the transfer overhead , I could cl_profiling _command _start / end to get the time ?
Although I reference the event types from the Khronos Group or other website , I still don't understand how to use other event types ,
Yes, there may be some overhead of using these buffers. Such pinned host memory can be directly accessed from the kernel, but in that case, kernel may take longer time due to data transfer overhead. You can't measure this extra overhead time separately as this time is included into the overall kernel execution time reported by the event profiling information. Profiling a kernel with various buffer type may give you a better indication. However, you can measure the map/upmap time explicitly. Sometimes these pinned host buffer are useful to transfer data between host and device compare to normal host memory (created with calloc/malloc).
Although I reference the event types from the Khronos Group or other website , I still don't understand how to use other event types
Its not clear to me. I hope you are not mixing the event execution status (e.g. CL_QUEUED, CL_SUBMITTED etc. ) with event command type (e.g. CL_COMMAND_NDRANGE_KERNEL, CL_COMMAND_READ_BUFFER, CL_COMMAND_WRITE_BUFFER etc.) [ref. clGetEventInfo]
If you're looking for an example to estimate the time, then here it is:
queue = clCreateCommandQueue(context, deviceId, CL_QUEUE_PROFILING_ENABLE, NULL); // enable the profiling
- - -
clEnqueue<COMMAND> (queue, ..., &event); // any clEnqueue<> command
- - -
cl_ulong end, start;
// gather the timing information
clGetEventProfilingInfo(eventGlobal, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0);
clGetEventProfilingInfo(eventGlobal, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, 0);
std::cout<<"Command Execution time: "<<(end-start)*1.0e-6f << "(ms)" << std::endl;