Hello Everyone!
For the last week at least, I've been trying to figure out what's going wrong in my program and I'm close to giving up,....
Here's my story:
I have written an OpenCL kernel which is can calculate the Voigt function. That task comes down to numerically performing a convolution of a Gauss and a Lorentz function. The algorithm depends heavily on "Numerical Receipes" and that's why I cannot post it here. Anyway, I suppose there's nothing wrong with it so far.
As C# is the programming language I'm most familiar with, I used CLOO (OpenCL C# bindings) to run the kernel. Calculating 5000 points of the Voigt function gives a speedup of about a factor 50 when using a HD5850 instead of Core i5 750. So far so good.
Then for some reasons I had to switch to C++. Running the same Kernel things have changed somewhat. CPU and GPU are about equivalently slow. Meaning Code running an the CPU has almost the same speed as with CLOO and code running on the GPU is always slighly slower (about 10% - 20%). That I'm actually really using the GPU is easily confirmed by looking at the processor usage.
I really cannot imagine what's causing this behaviour. First naive idea was that's the fault of the C++ bindings. But it is not. Doing the same thing in plane C results in the same behaviour,.....
I really don't do any fancy stuff. Initialization and running is just in the way you see it in the tutorials. Though most of them use CPU device. My idea is, that there's more to do then just to change CL_DEVICE_TYPE_CPU to CL_DEVICE_TYPE_GPU, is there? But the funny thing is, the code runs and produces right results, it's just far to slow,...
Does anyone have an idea what's going wrong????
-> I just realize, that I cannot access mysource code right now, as it's still on my office pc. I'll post it as soon as I get hold of it again <-
what is your local workgroup size? you should set at least 64.
thanks for the quick reply!
local worksize is not the problem. In this case I let the runtime decide what worksize is the best, but I also tried setting it manually to different values. No improvement,...
As I'am facing the same problem ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=131629&enterthread=y ).
Specification says:
If local_work_size is specified, the values specified in global_work_size[0], ... global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size[0], ... local_work_size[work_dim – 1].
Not every problem is so, that it can be broken in parts of 64. I can't see why a lower work group size should create such a performance penalty.
because every workgroup is running on one SIMD core. each SIMD core have 16 5D units. on this 16 units is running in 4 wavefront 64 threads. if you specify less then 64 threads for example 16 then you can not get 100% utilization of HW beacause it run in one wavefront and then wait 3 wavefront to run again.
I made some tests:
I used an image (4096x4096) and convoluted it with a gaussian filter of size 8x8. The kernel is totally unwrapped and the values of the filter are constants in the code. All calculations are in single precision. I just measured the seconds and ran the tests in normal desktop mode so accuracy is not perfect. One loop looks like this:
queue.enqueueNDRangeKernel(...)
queue.finish(...)
*edit: first and second column are global work size and local work size*
Intel Core2 Duo E8400 @ 3GHz | loops | secs | secs per loop | kernel attributes | |
(4096 - 32)x(4096 - 32) | 32x32 | 100 | 72 | 0.72 | _attribute__((reqd_work_group_size(32, 32, 1))) |
(4096 - 32)x(4096 - 32) | 0x0 | 100 | 72 | 0.72 | _attribute__((reqd_work_group_size(32, 32, 1))) |
(4096 - 32)x(4096 - 32) | 0x0 | 100 | 72 | 0.72 | |
(4096 - 32)x(4096 - 32) | 1x1 | 100 | 1228 | 12.28 | |
HD4850 | |||||
(4096 - 32)x(4096 - 32) | 0x0 | 1,000,000 | 45 | 0.000045 | _attribute__((reqd_work_group_size(16, 16, 1))) |
(4096 - 32)x(4096 - 32) | 16x16 | 100 | 7 | 0.07 | _attribute__((reqd_work_group_size(16, 16, 1))) |
(4096 - 32)x(4096 - 32) | 16x16 | 100 | 7 | 0.07 | |
(4096 - 32)x(4096 - 32) | 0x0 | 100 | 5 | 0.05 | |
(4096 - 32)x(4096 - 32) | 1x1 | 100 | 187 | 1.87 |
The first GPU result is somewhat strange, but I ran it several times. Can this be?