5 Replies Latest reply on Jul 4, 2011 5:37 AM by Steveyoungs

    Big slowdown when slightly changing the work size

    rbarreira

      I was trying out OpenCL with a small benchmark application I wrote. The kernel is a simple numerical calculation with unsigned ints, and clEnqueueNDRangeKernel is called with a 1-dimensional work load. So there's nothing strange going on in the benchmark, no cooperation between threads, each thread simply reads 3 values from input arrays (using "get_global_id (0)" as the index), calculates something from it, and stores the result in an output array (same index).

      Very strangely, the time taken to finish 10 iterations of this work varies widely when just changing the work size from 1,000,000 to 1,194,877 (this happens both with the CPU and the GPU).

      Here are the benchmark results for the GPU (HD 5750):

      work_size = 1000000, time taken = 4296 ms

      work_size = 1194877, time taken = 36612 ms

      work_size = 2000000, time taken = 7913 ms

      As you can see there's a huge slowdown when using that special number as the input size. A bigger (but rounder) number actually runs much faster than this odd input size, which means it would be faster to add some garbage to the array just to make the input bigger even if we want to calculate fewer values.

      The OpenCL SDK must be doing something strange here if it can't run non-round input sizes efficiently.

        • Big slowdown when slightly changing the work size
          rbarreira

          I just noticed 1194877 is a prime number... might this be the reason?

          edit - One more weird thing... when running with the CPU (which is a Phenom II X6 so 6 cores), even when using a multiple of 6 as the work size, only half the CPU cores are being used.

          For reference, this is how I'm issuing the work:

           

                // Run kernel on all input data
                  work_size[0] = N_INPUT_ITEMS;
                  clSetKernelArg (kernel, 0, sizeof (device_x_in), &device_x_in);
                  clSetKernelArg (kernel, 1, sizeof (device_n_in), &device_n_in);
                  clSetKernelArg (kernel, 2, sizeof (device_ni_in), &device_ni_in);
                  clSetKernelArg (kernel, 3, sizeof (device_buf_out), &device_buf_out);

                  error_code = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, work_size, NULL, 0, NULL, NULL);


            • Big slowdown when slightly changing the work size
              maximmoroz

              > I just noticed 1194877 is a prime number... might this be the reason?

              Most likely. By not specifying local worksize you instruct runtime to determine it automatically. Try profiling this benchmark application in AMD APP Profiler (Visual Studio integrated). You will see "wavefront count" column there. When I first saw profiler results for the fist time and inspected wavefront count values... well I rewrote all enqeueNDRange calls and modified kernels a little so that I would be able to specify local worksize (and, effectively, wavefront count).

                • Big slowdown when slightly changing the work size
                  rbarreira

                  Thanks, I will look into that.

                  But I have to say it's disappointing that this happens with a supposedly "write once run everywhere" API, it makes me think that different hardware may cause performance to randomly go down unless the code is very careful to set work sizes optimally. Not specifying a local work size should tell the API to do whatever is optimal...

                  The fact that I apparently can't get OpenCL to use all cores on my CPU doesn't impress me either...

                    • Big slowdown when slightly changing the work size
                      maximmoroz

                       

                      Originally posted by: rbarreira Thanks, I will look into that.

                       

                      But I have to say it's disappointing that this happens with a supposedly "write once run everywhere" API, it makes me think that different hardware may cause performance to randomly go down unless the code is very careful to set work sizes optimally. Not specifying a local work size should tell the API to do whatever is optimal... The fact that I apparently can't get OpenCL to use all cores on my CPU doesn't impress me either...

                       

                      Of course, OpenCL based application is portable, but not "performance-portable".

                      Besides, I am not sure that the problem with half cores occupied by the kernel is related to the non-specified local worksize.

                      • Big slowdown when slightly changing the work size
                        Steveyoungs

                         

                        Originally posted by: rbarreira

                         

                        The fact that I apparently can't get OpenCL to use all cores on my CPU doesn't impress me either...

                         

                        This must be a problem local to you. When I run on the CPU, I see all cores utilised, even when the local work size is not an exact multiple of the number of cores.

                        Like maximmoroz, I always specify the local work size, rouding up the global worksize to be a multiple of this. Using a simple check in the kernel to prevent out of bounds problems.