7 Replies Latest reply on Oct 13, 2010 5:59 PM by tdeneau

    Clues to runtime tuning of ATI's OpenCL

    jcpalmer

      I have just completed the OpenCL portion of my application, using hardcoded work & work group dimensions, on a 2 GPU Mac OSX environment. This program will be supported on OSX & Windows.  I want to build a calibration routine, which takes into account the OS, and OpenCL information about each device, then through timed test runs of each kernel, determine the best work & work group size for each device.

      I find no input on how to determine sizes for ATI GPU's.  Is this because it is simply a matter of getting the number of processors?  Is I/O latency hiding important?  Are there any magic numbers like NVIDIA's WARP?  Will specifying too high a work group size cause a crash?  Are the max compute units, max work item sizes & max work group sizes the same across the entire GPU product line?

      I have not been able to actually try my program on an ATI GPU due to my use of textures.  This has not been an issue till now.  Developing on more than system at a time is too much work.  I would to pencil something in for Windows / ATI, while I am here though.

        • Clues to runtime tuning of ATI's OpenCL
          genaganna

          Make sure that your workGroup size is multiples of wavefronts. wavefront is 64 for HD 4xxx series and HD 5xxx series. It is similar to WARP in CUDA.

           

          Use clGetDeviceInfo functions to know maximum limit on workGroup size.

           

          Look at any sample shipped with ATI STREAM SDK BETA4

          • Clues to runtime tuning of ATI's OpenCL
            MicahVillmow
            Genaganna,
            Actually the wavefront size is only 64 for the highend cards(48XX, 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend cards. You can query via CAL for information on your specific card.
              • Clues to runtime tuning of ATI's OpenCL
                emuller

                @Micah

                So basically, there's no way to query the wavefront/warp size in the OpenCL spec?

                 

                • Clues to runtime tuning of ATI's OpenCL
                  jcpalmer

                  I am starting to think that Optimum WorkGroup Size should be added to OpenCL as a query to a device.  Having multiple possible values makes this difficult.  I am written in Java, dynamically loading the DLL.  Querying CAL is not an option.

                  What I have coded up for now is a function which tries to reverse engineer it.  I plan to test this out first on Mac OSX NVIDIA GPU's.  Here is the code ( should be readable as psuedo code for those not familar with Java ). The premise of this is, testing from low to high, when the wavefront for the device is reached, the execution time for a kernel should easily be less that the time of wavefront + 1.  Does this even look feasible (max will be set to 64 when called) ? 

                  private int reverseEngWorkGroupSizeMultiple(int max){ Kernel kernel = null; try{ kernel = new Kernel(this, "__kernel void foo(){for(int i = 0; i < 1000000; i++) float x = 10.0f * 10.0f; }", "-cl-fast-relaxed-math", "foo"); NativeLong[] optimumWorkSize = new NativeLong[] { new NativeLong() }; NativeLong[] horribleWorkSize = new NativeLong[] { new NativeLong() }; float optimumTime, horribleTime; int ret = 8; while(ret <= max){ optimumWorkSize [0].setValue(ret); horribleWorkSize[0].setValue(ret + 1); optimumTime = cmdQueue.executeKernel(kernel, optimumWorkSize, optimumWorkSize); horribleTime = cmdQueue.executeKernel(kernel, horribleWorkSize, horribleWorkSize); if (optimumTime < horribleTime) break; ret *= 2; } return ret; }finally{ if (kernel != null) kernel.releaseKernel(); } }

                • Clues to runtime tuning of ATI's OpenCL
                  MicahVillmow
                  This is correct,
                  The best you can do is query CL_KERNEL_WORK_GROUP_SIZE with the clGetKernelWorkGroupInfo API call and that will tell you what the largest size you can execute and the wavefront/warp size most likely will be an integer multiple of that value. In some cases that value will equal to the wavefront/warp size, but that is usually because of resource constraints.