cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

andreac
Journeyman III

R290 linux - Two stage reduction - Low performance

Hi,

I'm using a R290 under ubuntu 12.04 with opencl (catalyst-13.12).

This Is what I find strange:

I write a program that do a two stage reduction (as described here http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-c...)

I run it on a vector of 100000000 elements and I loop the kernel queue 10000 times. I'm using a local work group size of 64.

While the programm is running I typically get:

~$ aticonfig --adapter=0 --od-getclocks

Adapter 0 - AMD Radeon R9 290 Series
                            Core (MHz)    Memory (MHz)
           Current Clocks :    600           400
             Current Peak :    600           400
  Configurable Peak Range : [300-800]     [400-700]
                 GPU load :    0%

~$ aticonfig --adapter=0 --od-gettemperature

Adapter 0 - AMD Radeon R9 290 Series
            Sensor 0: Temperature - 53.50 C

The question is why the clock doesn't go up? Also the temperature doesn't grow and the GPU load is 0%...

The results in performance is 8.3e+09 red/s witch according the article is not so exciting

Thanks every help is welcome

0 Likes
8 Replies
andreac
Journeyman III

I add some information, the problem for temperature and gpu load was a false problem.

Setting DISPLAY=:0.0 now I can read the correct temperature and load.

The question about the performance for the two stage reduction is still open.

0 Likes

@andreac Can you verify that OpenCL lists it as an active device with `clinfo`?  I have a Sapphire r290x that works with aticonfig but is not recognized by OpenCL

0 Likes

My card is listed in clinfo and works in opencl

I followed the info found here  http://devgurus.amd.com/thread/167775

0 Likes
ravkum
Staff

Hi,

What happens when you keep the vector size 10000000? Do you get similar performance as mentioned in the article?

What if you change the work group size from 64 to 256? You may need to take care of any local memory in this case.

Regards,

Ravi

0 Likes

Changing to work size from 64 256 the results becomes wrong as you predicted and  performance in terms of red/s is exactly the same.

With WS 64 and vector len 10000000 I mesure 8.32e+08 red/s (10 time slower).

0 Likes

Thanks for the tests andreac. This is strange. Is it possible for you to send me your code?

0 Likes

Yes of course, I'll send you the code in the next days.

Thanks

  Andrea

0 Likes

Hi Andrea,

   It seems that the way you are measuring time taken by kernel knl_reduce (by using get_timestamp) is not very accurate. It includes all the time taken by kernel queuing, launching and execution. In your case it is also counting the host code time, for example the "if" condition below, and the printf.

   A better method to measure time is to use the OpenCL native APIs to measure kernel execution on the device. I am posting the relevant code.

cl_event timing_event;

  unsigned long total_elapsed1 = 0;

  cl_ulong start_time;

  cl_ulong end_time;

  unsigned long elapsed1 = 0;

  for (int i = 0; i<ntrips; i++) {

    CALL_CL_GUARDED(clEnqueueNDRangeKernel,

                    (queue, knl_Reduce,

                     /*dimensions*/ 1, NULL, gdim, ldim,

                     0, NULL, &timing_event));

  

    CALL_CL_GUARDED(clFinish, (queue));

  

    CALL_CL_GUARDED(clGetEventProfilingInfo,

                    (timing_event,

              CL_PROFILING_COMMAND_START,

            sizeof(cl_ulong),

              &start_time, NULL));

    CALL_CL_GUARDED(clGetEventProfilingInfo,

                    (timing_event,

              CL_PROFILING_COMMAND_END,

            sizeof(cl_ulong),

              &end_time, NULL));

    elapsed1 = (unsigned long) (end_time-start_time);

    total_elapsed1 = total_elapsed1 + elapsed1;

Also you will need to set the property of command queue as profile-enabled.

Using this and running your code on Radeon HD 6570 with an array size of 100000 and 100 iterations, the results we are getting are:

1. Using your method: 1.36 e08 red/s

2. By enabling native OpenCL profiling: 4.61 e08 red/s

Please let us know what are you getting using OpenCL profiling.  We are meanwhile looking at your code to see further optimization gaps.

Thanks,

AMD Support

0 Likes