I've wrote very simple OpenCL kernel which fills pixels by work ID.
I got terribly slow performance from this kernel with ATI Stream SDK 2.0beta on Windows Vista64.
It requires about 8 secs to execute which is unbelievable for me. On the other hand Snow Leopard executes same kernel within 0.0001 sec.
Does anyone know the reason why so slow on Windows?
More is available at the following site.
http://lucille.atso-net.jp/blog/?p=907
__kernel void main( __global uint *out, uint col) // not used. { int x = get_global_id(0); int y = get_global_id(1); out[x+y*get_global_size(0)] = (uint)(x | (y << 😎 | (255 << 16) | (255<<24)); }
Which hardware are you running the Snow Leopard implementation on and what is the global work size you are using?
I'll venture a guess (slightly obvious): The ATI Stream SDK 2.0 beta can only use the CPU for computations in the current release as stated in the release notes. Snow leopard on the other hand has the Drivers and kernel hooks built into the OS directly, so yea, on Windows only your CPU is doing the math, unlike a capable GPU running Snow Leopard.
Hint: look at your screenshot, and the CL_DEVICE_NAME... stating to you it's using the CPU and not your GPU.
Hope you have fun practicing on Snow Leopard until a Microsoft release has GPU support.
Both are running on CPU(CL_DEVICE_TYPE_CPU), and The program use following work size.
global work size = (256, 256)
local work size = (1, 1)
Are you using the same source code(host &kernel code) for both the platforms? Could you post the host side code?
I am getting around 140 fps in the NBody sample running on CPU (using ATI StreamSDK 2.0 sample on Phenom Quad). It shouldn't take 8 secs to execute a simple kernel like yours. Can you post the host+kernel code?
The host code is same as OpenCL AO Bench.
http://kioku.sys-k.net/archives/2009/08/opencl_ao_bench.html
I am using VS2009 and I've found executing OpenCL app through [Debug] -> [Start Debugging] causes terrible performance slowdown in my case(8secs. Even if the app was built with Release settings). Executing OpenCL app through [Debug] -> [Start without debugging] gives normal performance(0.05 secs).
Hope it helps when you develop OpenCL app with VS2009.
This particularly issue with AO bench is a known problem. This is actually due to a number of small things that add up, each has been addressed and will appear in an up and comming refresh.
One thing to note; a launch of 1,1 may not always be the best choice on our implementation and it might be worth trying different values for this, e.g. 8x8 or 16x16.