Hello.
I have linux ubuntu natty and windows 7 running on same machine. Problem is that in windows the AMD APP-SDK examples timing seems work properly but in linux they do not.
What i see as the source of the problem is that clGetEventProfilingInfo returns same value for start and end time (in MatrixMultiplication.cpp from samples).
Any idea what is going on here?
The results seems to be correct and everything (that i have tested so far) seems to work (though for some reason texture writes are also slower than with nvidia gts 250 ?)
I have installed AMD-APP-SDK-v2.4-lnx6 on this week from ATI-website and ati driver is from ubuntu natty repository. Catalyst control versions shows Driver version 8.84.60 (2D Driver version).
Thanks!
what Ubuntu version are you using.
Do you see the effect in all SDK samples or only in matrixmultiplication. It would be nice if you can post your clinfo output also.
Thanks for your fast reply.
I am using ubuntu natty, that is 11.04. My kernel is Linux sundberg-lant 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
The problem exists also in other examples, like GlobalMemoryBandwidth, ConstantBandwidth (but PCIeBandwidth and BufferBandwidth seems work).
Edit: Remove CPU info and update
CLinfo follows:
sundberg@sundberg-lant:~/Downloads/AMD-APP-SDK-v2.4-lnx64$ ./bin/x86_64/clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Max compute units: 12
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 790Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7f2541cd9800
Name: Barts
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.1353
Profile: FULL_PROFILE
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt
Unfortunately Ubuntu 11.04 is not officially supported as of now. You can try to run on a configuration which matches a configuration at http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx
I have the same issue on Ubuntu 10.04. It seems to only be an issue with events returned from Image read/write commands. I can get correct times from regulare buffer read/write and kernel executions. If you need a work around for the time being the Submit time is usually close to when it "starts" and it seems to be reported correctly.
Just as an note, also the AMD APP Profiler shows all 'Time' as 0's.
Timing with CPU side (from launch to clFinnish) naturally works fine.
asy1502,
Can you please guide me in reproducing this issue. Is it possible to reproduce it using some SDK Samples(those using images)? OR It would be very nice if you can post some post testcase.
Thanks
I don't really simple example code to supply. I will post my output from my timer which simply catalogs the information for cl::Event.
Here is a snippet of code from my test program which is responsible for writing the image to the GPU:
#pragma omp parallel for private(i)// schedule(static,1)
for(i = 0; i<num_gpus; i++)
{
try
{
//can't assign an event even though underneath it will enqueue a command to write to GPU.
cl_img_in = cl::Image2D(cli->context_vec,CL_MEM_READ_ONLY,fmt,x,y/num_gpus,0,NULL,NULL);
cl_img_out = cl::Image2D(cli->context_vec,CL_MEM_WRITE_ONLY,fmt,x,y/num_gpus,0,NULL,NULL);
if(kern==KERNEL_AVERAGE || kern== KERNEL_FILTER)
cl_filterwidth = cl::Buffer(cli->context_vec,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,sizeof(int),&filterwidth,NULL);
if(kern == KERNEL_FILTER)
cl_filter = cl::Buffer(cli->context_vec,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,sizeof(float)*filterwidth*filterwidth,&filter,NULL);
cl::size_t<3> origin;
origin[0]=0;
origin[1]=0;
//origin[1]=i*(y/num_gpus);//+(i?1:0);
origin[2]=0;
//printf("origin (%d,%d,%d)\n",origin[0],origin[1],origin[2]);
cl::size_t<3> region;
region[0] = x;
//region[1] = (i+1)*(y/num_gpus);
region[1] = (y/num_gpus);
region[2] = 1;
//printf("region (%d,%d,%d)\n",region[0],region[1],region[2]);
cli->err = cli->queue.enqueueWriteImage(cl_img_in,CL_FALSE,origin,region,0,0,(void*)&img[(x*i*comp*(y/num_gpus))],NULL,&event_img_write);
cli->queue.flush();
cli->queue.finish();
}
catch (cl::Error er)
{
printf("j = %d, num_gpus = %d, i = %d\n",j,num_gpus,i);
printf("ERROR: %s(%s)\n", er.what(), oclErrorString(er.err()));
}
}
Image read on GPU # 1 of 1:
Submit Time: 15960737.608512999489903
Queue Time: 15960737.605700999498367
Start Time: 15960792.039936998859048
End Time: 15960792.040448999032378
Minimum Time: 0.000505000000000
Maximum Time: 0.000512000000000
Average Time: 0.000510000000000
Total Time: 0.005100000000000
Count: 10
Image read on GPU # 1 of 2:
Submit Time: 15960862.573286999017000
Queue Time: 15960862.567387999966741
Start Time: 15960918.930052999407053
End Time: 15960918.930557999759912
Minimum Time: 0.000505000000000
Maximum Time: 0.000512000000000
Average Time: 0.000508100000000
Total Time: 0.005081000000000
Count: 10
Image read on GPU # 2 of 2:
Submit Time: 15960862.572125999256968
Queue Time: 15960862.567855998873711
Start Time: 15960910.594889998435974
End Time: 15960910.595394998788834
Minimum Time: 0.000504000000000
Maximum Time: 0.000512000000000
Average Time: 0.000507400000000
Total Time: 0.005074000000000
Count: 10
Image write on GPU # 1 of 1:
Submit Time: 15960641.159187998622656
Queue Time: 15960641.149013999849558
Start Time: 15960731.170045999810100
End Time: 15960731.170555999502540
Minimum Time: 0.000504000000000
Maximum Time: 0.000512000000000
Average Time: 0.000508500000000
Total Time: 0.005085000000000
Count: 10
Image write on GPU # 1 of 2:
Submit Time: 15960793.389790998771787
Queue Time: 15960793.377495998516679
Start Time: 15960835.404035000130534
End Time: 15960835.404542999342084
Minimum Time: 0.000501000000000
Maximum Time: 0.000512000000000
Average Time: 0.000508000000000
Total Time: 0.005080000000000
Count: 10
Image write on GPU # 2 of 2:
Submit Time: 15960793.407910998910666
Queue Time: 15960793.398457000032067
Start Time: 15960860.419271999970078
End Time: 15960860.419784000143409
Minimum Time: 0.000504000000000
Maximum Time: 0.000512000000000
Average Time: 0.000508200000000
Total Time: 0.005082000000000
Count: 10
Negative calculation on GPU # 1 of 1:
Submit Time: 15960734.133491998538375
Queue Time: 15960734.113944999873638
Start Time: 15960734.382627999410033
End Time: 15960737.557519998401403
Minimum Time: 3.174305000000000
Maximum Time: 3.177864000000000
Average Time: 3.175332400000000
Total Time: 31.753323999999999
Count: 10
Negative calculation on GPU # 1 of 2:
Submit Time: 15960860.638787999749184
Queue Time: 15960860.606035999953747
Start Time: 15960860.884366000071168
End Time: 15960862.475106999278069
Minimum Time: 1.590198000000000
Maximum Time: 1.591197000000000
Average Time: 1.590588300000000
Total Time: 15.905882999999999
Count: 10
Negative calculation on GPU # 2 of 2:
Submit Time: 15960860.623779999092221
Queue Time: 15960860.588281998410821
Start Time: 15960860.937602998688817
End Time: 15960862.529036998748779
Minimum Time: 1.590707000000000
Maximum Time: 3.380671000000000
Average Time: 1.769900800000000
Total Time: 17.699007999999999
Count: 10
Hi asy1502,
does the OpenCl examples work with your machines -- for example the matrix multiplication?
So it seems like we may have 2 different issues. I get accurate start/end times on kernel executions which appears to be the onlything timed by the sdk samples. I get inaccurate timings on enqueueImageRead/Write commands where the start/end times are the same. When I run the MatrixMulImage sample I get accurate timings for exectution.
BTW, My system has 2 Firepro3D 7800's, Ubuntu 10.04, APP SDK 2.4, Catalyst 11.6.
Sorry everyone,
I realized I was linking to an old version of libOpenCL.so due to incorrect environment paths. I am now using APP SDK 2.5 and GPU memory transfers are reported correctly.