8 Replies Latest reply on Mar 4, 2014 11:46 PM by amd_support

    R290 linux - Two stage reduction - Low performance

    andreac

      Hi,

      I'm using a R290 under ubuntu 12.04 with opencl (catalyst-13.12).

      This Is what I find strange:

      I write a program that do a two stage reduction (as described here http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/)

      I run it on a vector of 100000000 elements and I loop the kernel queue 10000 times. I'm using a local work group size of 64.

      While the programm is running I typically get:

       

      ~$ aticonfig --adapter=0 --od-getclocks

       

      Adapter 0 - AMD Radeon R9 290 Series
                                  Core (MHz)    Memory (MHz)
                 Current Clocks :    600           400
                   Current Peak :    600           400
        Configurable Peak Range : [300-800]     [400-700]
                       GPU load :    0%

       

      ~$ aticonfig --adapter=0 --od-gettemperature

       

      Adapter 0 - AMD Radeon R9 290 Series
                  Sensor 0: Temperature - 53.50 C

       

      The question is why the clock doesn't go up? Also the temperature doesn't grow and the GPU load is 0%...

      The results in performance is 8.3e+09 red/s witch according the article is not so exciting

       

      Thanks every help is welcome

        • Re: R290 linux - Two stage reduction - Low performance
          andreac

          I add some information, the problem for temperature and gpu load was a false problem.

          Setting DISPLAY=:0.0 now I can read the correct temperature and load.

          The question about the performance for the two stage reduction is still open.

          • Re: R290 linux - Two stage reduction - Low performance
            ravkum

            Hi,

             

            What happens when you keep the vector size 10000000? Do you get similar performance as mentioned in the article?

            What if you change the work group size from 64 to 256? You may need to take care of any local memory in this case.

             

            Regards,

            Ravi

              • Re: R290 linux - Two stage reduction - Low performance
                andreac

                Changing to work size from 64 256 the results becomes wrong as you predicted and  performance in terms of red/s is exactly the same.

                With WS 64 and vector len 10000000 I mesure 8.32e+08 red/s (10 time slower).

                  • Re: R290 linux - Two stage reduction - Low performance
                    ravkum

                    Thanks for the tests andreac. This is strange. Is it possible for you to send me your code?

                      • Re: R290 linux - Two stage reduction - Low performance
                        andreac

                        Yes of course, I'll send you the code in the next days.

                        Thanks

                          Andrea

                          • Re: R290 linux - Two stage reduction - Low performance
                            amd_support

                            Hi Andrea,

                               It seems that the way you are measuring time taken by kernel knl_reduce (by using get_timestamp) is not very accurate. It includes all the time taken by kernel queuing, launching and execution. In your case it is also counting the host code time, for example the "if" condition below, and the printf.

                               A better method to measure time is to use the OpenCL native APIs to measure kernel execution on the device. I am posting the relevant code.

                             

                            [code]

                            cl_event timing_event;

                              unsigned long total_elapsed1 = 0;

                              cl_ulong start_time;

                              cl_ulong end_time;

                              unsigned long elapsed1 = 0;

                             

                              for (int i = 0; i<ntrips; i++) {

                                CALL_CL_GUARDED(clEnqueueNDRangeKernel,

                                                (queue, knl_Reduce,

                                                 /*dimensions*/ 1, NULL, gdim, ldim,

                                                 0, NULL, &timing_event));

                              

                                CALL_CL_GUARDED(clFinish, (queue));

                              

                                CALL_CL_GUARDED(clGetEventProfilingInfo,

                                                (timing_event,

                                          CL_PROFILING_COMMAND_START,

                                        sizeof(cl_ulong),

                                          &start_time, NULL));

                                CALL_CL_GUARDED(clGetEventProfilingInfo,

                                                (timing_event,

                                          CL_PROFILING_COMMAND_END,

                                        sizeof(cl_ulong),

                                          &end_time, NULL));

                             

                                elapsed1 = (unsigned long) (end_time-start_time);

                                total_elapsed1 = total_elapsed1 + elapsed1;

                            [/code]

                             

                            Also you will need to set the property of command queue as profile-enabled.

                             

                            Using this and running your code on Radeon HD 6570 with an array size of 100000 and 100 iterations, the results we are getting are:

                            1. Using your method: 1.36 e08 red/s

                            2. By enabling native OpenCL profiling: 4.61 e08 red/s

                             

                            Please let us know what are you getting using OpenCL profiling.  We are meanwhile looking at your code to see further optimization gaps.

                             

                            Thanks,

                            AMD Support