Archives Discussions

jbrussell · ‎04-13-2011

Profiling is showing the same time whether blocking set to CL_TRUE or CL_FALSE

clGetEventProfilingInfo() shows the same time (421 nanoseconds for a 10 MB image, that's fast) for both clEnqueueWriteImage() and clEnqueueReadImage(), and the time doesn't change if blocking is set to true or false. Linux, SDK 2.3, 5870 GPU. Loading the same amount of data to global memory using clEnqueueWriteBuffer() takes approx 10 ms, which is in line with the pci-e bandwidth testing sample result of 0.8+ GB/s. Is the blocking parameter ignored, or perhaps profiling not implemented for clEnqueueWriteImage()?

Thanks...

himanshu_gautam · ‎04-14-2011

jbrussel,

Can you please provide a test case and your system details.

In some situations no data transfer is actually required to/from host/device and so you get extremely fast buffer transfer timings. You can refer the buffer bandwidth sample as as an example. This might be the reason for such fast transfer.

jbrussell · ‎04-14-2011

The code is pretty straightforward, snippet attached, lifted from some of the sample code.

#ifdef USE_TEXTURE_MEM /* Enqueue texture image write */ size_t origin[] = {0, 0, 0}; size_t region[] = {width/2, height/2, 1}; status = clEnqueueWriteImage( commandQueue, inputBuffer, CL_TRUE, origin, region, 0, 0, input, 0, NULL, &events[0] ); #else /* Enqueue writeBuffer to put input image in gpu memory */ status = clEnqueueWriteBuffer( commandQueue, inputBuffer, CL_TRUE, 0, width * height * sizeof(cl_ushort), input, 0, NULL, &events[0] ); #endif if(status != CL_SUCCESS) { std::cout << "Error: input image clEnqueueWriteBuffer failed. \ (clEnqueueWriteBuffer) = " << status << std::endl; return 1; } /* Wait for the write buffer to finish execution */ status = clWaitForEvents(1, &events[0]); if (status != CL_SUCCESS) { std::cout<< "Error: Waiting for write buffer call to finish. \ (clWaitForEvents) = " << status << std::endl; return 1; } /* Calculate performance */ get_profile_time( events[0], &msec ); std::cout<< "Input image write time = "<< msec << " msec" << std::endl; int get_profile_time( cl_event event, double *millisecs ) { cl_int status = 0; cl_ulong startTime, endTime; /* Get kernel profiling info */ status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &startTime, 0); if (status != CL_SUCCESS) { std::cout<< "clGetEventProfilingInfo failed.(startTime)" << std:: endl; return 1; } status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &endTime, 0); if (status != CL_SUCCESS) { std::cout<< "clGetEventProfilingInfo failed.(startTime)" << std:: endl; return 1; } *millisecs = 1e-6 * (endTime - startTime); return 0; }

jbrussell · ‎04-14-2011

System info, output from CLInfo sample:

Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 ATI-Stream-v2.3 (451) Platform Name: ATI Stream Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: ATI Stream Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 20 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Native vector width char: 0 Native vector width short: 0 Native vector width int: 0 Native vector width long: 0 Native vector width float: 0 Native vector width double: 0 Max clock frequency: 900Mhz Address bits: 32 Max memory allocation: 134217728 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 536870912 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x2b3f3dce3880 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1016 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.3 (451) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Native vector width char: 16 Native vector width short: 8 Native vector width int: 4 Native vector width long: 2 Native vector width float: 4 Native vector width double: 0 Max clock frequency: 3200Mhz Address bits: 64 Max memory allocation: 1073741824 Image support: No Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 3221225472 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Kernel Preferred work group size multiple: 1 Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 999848 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x2b3f3dce3880 Name: Intel(R) Xeon(R) CPU W3570 @ 3.20GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.3 (451) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_media_ops cl_amd_popcnt cl_amd_printf

bpurnomo · ‎04-14-2011

The profiling timestamps for the read and write image objects are incorrect with SDK 2.3. Please try the APP SDK 2.4 to get the accurate timings.

jbrussell · ‎04-19-2011

Moving to SDK 2.4 solved the timing issue. Thanks...

Archives Discussions

Blocking parameter to clEnqueueReadImage() and clEnqueueWriteImage() ignored?