cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

flare5009
Journeyman III

Problem with events timing with Linux

On Linux AMD-APP-SDK-v2.4-lnx64 reports inf bandwidth and gflops as event start and end time are same

Hello.

I have linux ubuntu natty and windows 7 running on same machine. Problem is that in windows the AMD APP-SDK examples timing seems work properly but in linux they do not.

What i see as the source of the problem is that clGetEventProfilingInfo returns same value for start and end time (in MatrixMultiplication.cpp from samples).

Any idea what is going on here?

The results seems to be correct and everything (that i have tested so far) seems to work (though for some reason texture writes are also slower than with nvidia gts 250 ?)

I have installed AMD-APP-SDK-v2.4-lnx6 on this week from ATI-website and ati driver is from ubuntu natty repository. Catalyst control versions shows Driver version 8.84.60 (2D Driver version).

Thanks!

 

 

 

0 Likes
10 Replies
himanshu_gautam
Grandmaster

what Ubuntu version are you using.

Do you see the effect in all SDK samples or only in matrixmultiplication. It would be nice if you can post your clinfo output also.

0 Likes

Thanks for your fast reply.

I am using ubuntu natty, that is 11.04. My kernel is Linux sundberg-lant 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

The problem exists also in other examples, like GlobalMemoryBandwidth, ConstantBandwidth (but PCIeBandwidth and BufferBandwidth seems work).

Edit: Remove CPU info and update

CLinfo follows:

sundberg@sundberg-lant:~/Downloads/AMD-APP-SDK-v2.4-lnx64$ ./bin/x86_64/clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Device ID:                                     4098
  Max compute units:                             12
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    0
  Max clock frequency:                           790Mhz
  Address bits:                                  32
  Max memory allocation:                         134217728
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              32768
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            536870912
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                               
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:                             
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   0x7f2541cd9800
  Name:                                          Barts
  Vendor:                                        Advanced Micro Devices, Inc.
  Driver version:                                CAL 1.4.1353
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10)
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt

 

0 Likes

Unfortunately Ubuntu 11.04 is not officially supported as of now. You can try to run on a configuration which matches a configuration at http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx

0 Likes

I have the same issue on Ubuntu 10.04. It seems to only be an issue with events returned from Image read/write commands. I can get correct times from regulare buffer read/write and kernel executions. If you need a work around for the time being the Submit time is usually close to when it "starts" and it seems to be reported correctly.

0 Likes

Just as an note, also the AMD APP Profiler shows all 'Time' as 0's.

Timing with CPU side (from launch to clFinnish) naturally works fine.

 

0 Likes

asy1502,

Can you please guide me in reproducing this issue. Is it possible to reproduce it using some SDK Samples(those using images)? OR It would be very nice if you can post some post testcase.

Thanks

 

0 Likes

I don't really simple example code to supply. I will post my output from my timer which simply catalogs the information for cl::Event.

Here is a snippet of code from my test program which is responsible for writing the image to the GPU:

            #pragma omp parallel for private(i)// schedule(static,1)

            for(i = 0; i<num_gpus; i++)

            {

                try

                {

                    //can't assign an event even though underneath it will enqueue a command to write to GPU.

                    cl_img_in = cl::Image2D(cli->context_vec,CL_MEM_READ_ONLY,fmt,x,y/num_gpus,0,NULL,NULL);

                    cl_img_out = cl::Image2D(cli->context_vec,CL_MEM_WRITE_ONLY,fmt,x,y/num_gpus,0,NULL,NULL);

                    if(kern==KERNEL_AVERAGE || kern== KERNEL_FILTER)

                        cl_filterwidth = cl::Buffer(cli->context_vec,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,sizeof(int),&filterwidth,NULL);

                    if(kern == KERNEL_FILTER)

                        cl_filter = cl::Buffer(cli->context_vec,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,sizeof(float)*filterwidth*filterwidth,&filter,NULL);

 

                    cl::size_t<3> origin;

                    origin[0]=0;

                    origin[1]=0;

                    //origin[1]=i*(y/num_gpus);//+(i?1:0);                   

                    origin[2]=0;

//printf("origin (%d,%d,%d)\n",origin[0],origin[1],origin[2]);

 

                    cl::size_t<3> region;

                    region[0] = x;

                    //region[1] = (i+1)*(y/num_gpus);

                    region[1] = (y/num_gpus);

                    region[2] = 1;

//printf("region (%d,%d,%d)\n",region[0],region[1],region[2]);

 

                    cli->err = cli->queue.enqueueWriteImage(cl_img_in,CL_FALSE,origin,region,0,0,(void*)&img[(x*i*comp*(y/num_gpus))],NULL,&event_img_write);

                    cli->queue.flush();

                    cli->queue.finish();

                }

                catch (cl::Error er)

                {

                    printf("j = %d, num_gpus = %d, i = %d\n",j,num_gpus,i);

                    printf("ERROR: %s(%s)\n", er.what(), oclErrorString(er.err()));

                }

            }

Here are the timings:


Image read on GPU # 1 of 1:

Submit Time: 15960737.608512999489903

Queue Time: 15960737.605700999498367

Start Time: 15960792.039936998859048

End Time: 15960792.040448999032378

Minimum Time: 0.000505000000000

Maximum Time: 0.000512000000000

Average Time: 0.000510000000000

Total Time: 0.005100000000000

Count: 10

 

Image read on GPU # 1 of 2:

Submit Time: 15960862.573286999017000

Queue Time: 15960862.567387999966741

Start Time: 15960918.930052999407053

End Time: 15960918.930557999759912

Minimum Time: 0.000505000000000

Maximum Time: 0.000512000000000

Average Time: 0.000508100000000

Total Time: 0.005081000000000

Count: 10

 

Image read on GPU # 2 of 2:

Submit Time: 15960862.572125999256968

Queue Time: 15960862.567855998873711

Start Time: 15960910.594889998435974

End Time: 15960910.595394998788834

Minimum Time: 0.000504000000000

Maximum Time: 0.000512000000000

Average Time: 0.000507400000000

Total Time: 0.005074000000000

Count: 10

 

Image write on GPU # 1 of 1:

Submit Time: 15960641.159187998622656

Queue Time: 15960641.149013999849558

Start Time: 15960731.170045999810100

End Time: 15960731.170555999502540

Minimum Time: 0.000504000000000

Maximum Time: 0.000512000000000

Average Time: 0.000508500000000

Total Time: 0.005085000000000

Count: 10

 

Image write on GPU # 1 of 2:

Submit Time: 15960793.389790998771787

Queue Time: 15960793.377495998516679

Start Time: 15960835.404035000130534

End Time: 15960835.404542999342084

Minimum Time: 0.000501000000000

Maximum Time: 0.000512000000000

Average Time: 0.000508000000000

Total Time: 0.005080000000000

Count: 10

 

Image write on GPU # 2 of 2:

Submit Time: 15960793.407910998910666

Queue Time: 15960793.398457000032067

Start Time: 15960860.419271999970078

End Time: 15960860.419784000143409

Minimum Time: 0.000504000000000

Maximum Time: 0.000512000000000

Average Time: 0.000508200000000

Total Time: 0.005082000000000

Count: 10

 

Negative calculation on GPU # 1 of 1:

Submit Time: 15960734.133491998538375

Queue Time: 15960734.113944999873638

Start Time: 15960734.382627999410033

End Time: 15960737.557519998401403

Minimum Time: 3.174305000000000

Maximum Time: 3.177864000000000

Average Time: 3.175332400000000

Total Time: 31.753323999999999

Count: 10

 

Negative calculation on GPU # 1 of 2:

Submit Time: 15960860.638787999749184

Queue Time: 15960860.606035999953747

Start Time: 15960860.884366000071168

End Time: 15960862.475106999278069

Minimum Time: 1.590198000000000

Maximum Time: 1.591197000000000

Average Time: 1.590588300000000

Total Time: 15.905882999999999

Count: 10

 

Negative calculation on GPU # 2 of 2:

Submit Time: 15960860.623779999092221

Queue Time: 15960860.588281998410821

Start Time: 15960860.937602998688817

End Time: 15960862.529036998748779

Minimum Time: 1.590707000000000

Maximum Time: 3.380671000000000

Average Time: 1.769900800000000

Total Time: 17.699007999999999

Count: 10



0 Likes

Hi asy1502,

does the OpenCl examples work with your machines -- for example the matrix multiplication?

 

 

0 Likes

So it seems like we may have 2 different issues. I get accurate start/end times on kernel executions which appears to be the onlything timed by the sdk samples. I get inaccurate timings on enqueueImageRead/Write commands where the start/end times are the same. When I run the MatrixMulImage sample I get accurate timings for exectution.

BTW, My system has 2 Firepro3D 7800's, Ubuntu 10.04, APP SDK 2.4, Catalyst 11.6.

 

0 Likes

Sorry everyone,

I realized I was linking to an old version of libOpenCL.so due to incorrect environment paths. I am now using APP SDK 2.5 and GPU memory transfers are reported correctly.

0 Likes