cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jcpalmer
Adept I

Clues to runtime tuning of ATI's OpenCL

I have just completed the OpenCL portion of my application, using hardcoded work & work group dimensions, on a 2 GPU Mac OSX environment. This program will be supported on OSX & Windows.  I want to build a calibration routine, which takes into account the OS, and OpenCL information about each device, then through timed test runs of each kernel, determine the best work & work group size for each device.

I find no input on how to determine sizes for ATI GPU's.  Is this because it is simply a matter of getting the number of processors?  Is I/O latency hiding important?  Are there any magic numbers like NVIDIA's WARP?  Will specifying too high a work group size cause a crash?  Are the max compute units, max work item sizes & max work group sizes the same across the entire GPU product line?

I have not been able to actually try my program on an ATI GPU due to my use of textures.  This has not been an issue till now.  Developing on more than system at a time is too much work.  I would to pencil something in for Windows / ATI, while I am here though.

0 Likes
7 Replies
genaganna
Journeyman III

Make sure that your workGroup size is multiples of wavefronts. wavefront is 64 for HD 4xxx series and HD 5xxx series. It is similar to WARP in CUDA.

 

Use clGetDeviceInfo functions to know maximum limit on workGroup size.

 

Look at any sample shipped with ATI STREAM SDK BETA4

0 Likes

Genaganna,
Actually the wavefront size is only 64 for the highend cards(48XX, 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend cards. You can query via CAL for information on your specific card.
0 Likes

@Micah

So basically, there's no way to query the wavefront/warp size in the OpenCL spec?

 

0 Likes

I am starting to think that Optimum WorkGroup Size should be added to OpenCL as a query to a device.  Having multiple possible values makes this difficult.  I am written in Java, dynamically loading the DLL.  Querying CAL is not an option.

What I have coded up for now is a function which tries to reverse engineer it.  I plan to test this out first on Mac OSX NVIDIA GPU's.  Here is the code ( should be readable as psuedo code for those not familar with Java ). The premise of this is, testing from low to high, when the wavefront for the device is reached, the execution time for a kernel should easily be less that the time of wavefront + 1.  Does this even look feasible (max will be set to 64 when called) ? 

private int reverseEngWorkGroupSizeMultiple(int max){ Kernel kernel = null; try{ kernel = new Kernel(this, "__kernel void foo(){for(int i = 0; i < 1000000; i++) float x = 10.0f * 10.0f; }", "-cl-fast-relaxed-math", "foo"); NativeLong[] optimumWorkSize = new NativeLong[] { new NativeLong() }; NativeLong[] horribleWorkSize = new NativeLong[] { new NativeLong() }; float optimumTime, horribleTime; int ret = 8; while(ret <= max){ optimumWorkSize [0].setValue(ret); horribleWorkSize[0].setValue(ret + 1); optimumTime = cmdQueue.executeKernel(kernel, optimumWorkSize, optimumWorkSize); horribleTime = cmdQueue.executeKernel(kernel, horribleWorkSize, horribleWorkSize); if (optimumTime < horribleTime) break; ret *= 2; } return ret; }finally{ if (kernel != null) kernel.releaseKernel(); } }

0 Likes

I agree. It's hard to realize what's the optimum working size for the kernel.

You should add some kind of parameter to the device's capabilities to indicate this.

 

Also, it would be fantastic to know if the OpenCL device is being used as primary display so we can avoid that device for computations due to the 5s watchdog or to avoid GUI's lag while it's performing the computations.

0 Likes

jcpalmer,

Just a note that if you're interested in accessing OpenCL from Java, you may want to take a look at the Aparapi tool at

http://developer.amd.com/aparapi

Aparapi allows you to write your parallel kernel code in Java.

 

0 Likes

This is correct,
The best you can do is query CL_KERNEL_WORK_GROUP_SIZE with the clGetKernelWorkGroupInfo API call and that will tell you what the largest size you can execute and the wavefront/warp size most likely will be an integer multiple of that value. In some cases that value will equal to the wavefront/warp size, but that is usually because of resource constraints.
0 Likes