Archives Discussions

jhabig · ‎04-23-2010

Hello Everyone!

For the last week at least, I've been trying to figure out what's going wrong in my program and I'm close to giving up,....

Here's my story:

I have written an OpenCL kernel which is can calculate the Voigt function. That task comes down to numerically performing a convolution of a Gauss and a Lorentz function. The algorithm depends heavily on "Numerical Receipes" and that's why I cannot post it here. Anyway, I suppose there's nothing wrong with it so far.

As C# is the programming language I'm most familiar with, I used CLOO (OpenCL C# bindings) to run the kernel. Calculating 5000 points of the Voigt function gives a speedup of about a factor 50 when using a HD5850 instead of Core i5 750. So far so good.

Then for some reasons I had to switch to C++. Running the same Kernel things have changed somewhat. CPU and GPU are about equivalently slow. Meaning Code running an the CPU has almost the same speed as with CLOO and code running on the GPU is always slighly slower (about 10% - 20%). That I'm actually really using the GPU is easily confirmed by looking at the processor usage.

I really cannot imagine what's causing this behaviour. First naive idea was that's the fault of the C++ bindings. But it is not. Doing the same thing in plane C results in the same behaviour,.....

I really don't do any fancy stuff. Initialization and running is just in the way you see it in the tutorials. Though most of them use CPU device. My idea is, that there's more to do then just to change CL_DEVICE_TYPE_CPU to CL_DEVICE_TYPE_GPU, is there? But the funny thing is, the code runs and produces right results, it's just far to slow,...

Does anyone have an idea what's going wrong????

-> I just realize, that I cannot access mysource code right now, as it's still on my office pc. I'll post it as soon as I get hold of it again <-

nou · ‎04-23-2010

what is your local workgroup size? you should set at least 64.

jhabig · ‎04-23-2010

thanks for the quick reply!

local worksize is not the problem. In this case I let the runtime decide what worksize is the best, but I also tried setting it manually to different values. No improvement,...

Tasp · ‎04-24-2010

As I'am facing the same problem ( http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=131629&enterthread=y ).

Specification says:

If local_work_size is specified, the values specified in global_work_size[0], ... global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size[0], ... local_work_size[work_dim – 1].

Not every problem is so, that it can be broken in parts of 64. I can't see why a lower work group size should create such a performance penalty.

nou · ‎04-24-2010

because every workgroup is running on one SIMD core. each SIMD core have 16 5D units. on this 16 units is running in 4 wavefront 64 threads. if you specify less then 64 threads for example 16 then you can not get 100% utilization of HW beacause it run in one wavefront and then wait 3 wavefront to run again.

Tasp · ‎04-24-2010

I made some tests:

I used an image (4096x4096) and convoluted it with a gaussian filter of size 8x8. The kernel is totally unwrapped and the values of the filter are constants in the code. All calculations are in single precision. I just measured the seconds and ran the tests in normal desktop mode so accuracy is not perfect. One loop looks like this:

queue.enqueueNDRangeKernel(...)
queue.finish(...)

*edit: first and second column are global work size and local work size*

Intel Core2 Duo E8400 @ 3GHz		loops	secs	secs per loop	kernel attributes
(4096 - 32)x(4096 - 32)	32x32	100	72	0.72	_attribute__((reqd_work_group_size(32, 32, 1)))
(4096 - 32)x(4096 - 32)	0x0	100	72	0.72	_attribute__((reqd_work_group_size(32, 32, 1)))
(4096 - 32)x(4096 - 32)	0x0	100	72	0.72
(4096 - 32)x(4096 - 32)	1x1	100	1228	12.28

HD4850
(4096 - 32)x(4096 - 32)	0x0	1,000,000	45	0.000045	_attribute__((reqd_work_group_size(16, 16, 1)))
(4096 - 32)x(4096 - 32)	16x16	100	7	0.07	_attribute__((reqd_work_group_size(16, 16, 1)))
(4096 - 32)x(4096 - 32)	16x16	100	7	0.07
(4096 - 32)x(4096 - 32)	0x0	100	5	0.05
(4096 - 32)x(4096 - 32)	1x1	100	187	1.87

The first GPU result is somewhat strange, but I ran it several times. Can this be?

MicahVillmow · ‎04-26-2010

nou,
That is not entirely correct, you are mixing up terminology and concepts. We list in the ATI_Stream_SDK_OpenCL_Programming_Guide.pdf the correct names of the various computation components of the GPU in the OpenCL context. A wavefront consists of 16, 32 or 64 work-items, depending on the hardware, that are executed in parallel on a Compute Unit(also known as a SIMD) over four cycles. How the hardware actually works is this. Two wavefronts will execute on the same Compute Unit over 8 cycles alternating between an Even and Odd wavefront and executing 1/4th of the work-items in a wavefront every two cycles. If only a single work-group is executing on a SIMD, all wavefronts will come from the same work-group. If multiple work-groups are executing on the Compute Unit in parallel, the Even and Odd wavefronts can come from separate work-groups. If there is a single wavefront executing on a Compute Unit, then half of the cycles are wasted. If the work-group size is not a multiple of the wavefront size, then there is wasted work-items per work-group equal to the number of work-items required to make the work-group size a multiple of the wavefront size. For example, if you have a work-group size of 65 work-items, a wavefront of 64 work-items will be launched and then a wavefront with a single work-item. So, basically 63 work items are being wasted for every 65 that are executed, or in other terms, slightly better than 50% utilization of the chip.

Archives Discussions

GPU too slow using c or c++