Archives Discussions

pratapk · ‎08-04-2011

optimal Global worksize

we need to chose the local workgroup size to in the order of warp size for optimal performance.

Does that apply to Global work size, I've seen following code in one of Nvidia OpenCL slides.

size_t localWorkSize = 256; ( or 64)

// will round it

int numberWorkGropus = ( N + localWorkSize -1) / localWorkSize ;

size_t globalWorkSize = numberWorkGropus * localWorkSize ;

But, rounding and multiplying increases a global workgroup to be beyond the 'N',

1) We would be having global_id beyond required, how to take care of it ?

2) It can waste the OpenCL threads, Is there any advantage in doing it ?

pratapk · ‎08-08-2011

Please let me know if you want me to re-phrase the question or need more information.

maximmoroz · ‎08-08-2011

1) Pass actual size as the parameter and compare get_global_id(0) with it in the start of the kernel.

2) It is actually the most effective way to use GPU. Number of wavefronts required to execute the kernel is much more important than global work size.

3) Read AMD OpenCL Programming guide.

LeeHowes · ‎08-08-2011

In answer to your specific questions:

1) Yes, it's beyond the range of data but as the example probably shows and maximmoroz pointed out you can pass that in and catch it in a conditional.

2) This is why defining "thread" sensibly is important. You're not wasting wavefronts, and that's the real execution multiple of the machine (a wavefront is a thread if you think in CPU terms). You may waste a few lanes of the last one in the machine, but who cares? The machine will be running maybe 150 wavefronts concurrently, and thousands over a dispatch, if 1/4 of one of those thousands of wavefronts is unused you won't even notice the difference.

The correct answer to the general question of whether you want to round your launch up, though, is "maybe".

It depends on whether the if test at the beginning of the kernel incurs more overhead than padding your data. If you can lay your data out with ignorable junk at the end up to the next wavefront boundary then on any reasonable size dataset the extra memory reads will be negligable, but if the kernel is short then the branch will be costly.

On the other hand, rearranging your data in that way might be difficult under some circumstances and that might incur overhead from performing a copy.

In general for peak performance you should lay your data out to match that you're running on a vector processor when you use the GPU, and that includes padding to save doing a range test in the kernel.