Archives Discussions

timchist · ‎06-10-2013

How can I query wavefront size from a kernel? (is there an analog to the warpSize built-in variable in CUDA).

gbilotta · ‎06-15-2013

LeeHowes wrote:

It will work, but it isn't standard that it has those semantics. There's no reason why a compiler couldn't (and VERY good reasons why it should) pack multiple work items into a single SIMD lane so that, in effect, we end up with 128 work-items in a wavefront. At that point the kernel should report a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 128, not 64. At this time the toolchain for the GPU may not do this, but the fact remains that it could and the specification does not preclude this behaviour.

The device-specific flags seem the only safe way, and even then because the toolchain makes only very weak guarantees you have to be careful what assumptions you make about the toolchain behaviour. For example, dropping barriers within wave-synchronous code is obviously the right way to program vector hardware, but it is not necessarily the right way to program OpenCL because the compiler makes no guarantee that it will respect any communication you are doing.

Actually, in some way, this is precisely the reason why you should query the kernel-specific flag rather than the platform-specific device flag, in most cases.

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE is essentially a performance hint, mentioning that to get full utilization of the hardware, the wg size should be a multiple of the given amount, otherwise some of the lanes might remain idle. (Interesting tidbit: Intel's platform on CPU recommends WGs multiple of 128 usually, even when it is NOT the actual SIMD width of the CPU.). In most cases, this is what you actually want.

The only case that I can think of when you would want to know the actual width of the wave/warp is when you are doing wave-synchronous programming, which is, or rather used to be, a very efficient way of GPU coding, since it relieve the need for block-synchronization instructions. However,as GPU hardware becomes more sophisticated and GPU compilers try to be more efficient in latency covering by interleaving independent instructions, wave-synchronous programming becomes less and less reliable. In fact, even NVIDIA has started recommending that warp-sync programming not be used in CUDA, so this is not even strictly related by the programming model assumptions done in OpenCL.

Additionally, as you mention, in OpenCL itself wave-synchronous programming cannot be guaranteed (although you can go a long way by enforcing ordering with volatile qualifiers). Let's consider the case of a (future) GPU platform that does autovectorization and work-item merging. If your kernel uses wave-synchronous programming (no barriers) and the GPU decides e.g. to merge two items into a single hw thread, there is no way to guarantee the correctness of execution, even if you use the hardware-specific wavefront size as your wg size. Assume for example a 64-wide wave. Then, by the compiler workitem merge you would have 128 workitems processed by each wave. If you set a WG size of 128, a single wave will be executed per WG, but the order in which the two workitems will be processed by each thread is actually unknown (they might be processed concurrently, or interleaved, or any other combination). If you set a WG size of 64 (the hardware wavefront width), you still have the problem of the unknown ordering within the single thread, and in addition you'll only be using half of the wave per group. So by using the HW wave size you still have no guarantee that the wave-synchronous code will behave correctly, and by not using the preferred WG size multiple you're underutilizing your hardware.

So there are actually two things that are emerging from this discussion, from what I see.

(1) you should always use the preferred WG size multiple flag instead of the vendor-specific device attributes; they will give you the same results if the compilers don't merge work-items, and they will give you the correct result (for the specific kernel) in vectorizing compiler, performance-wise;

(2) wave-synchronous programming cannot be reliably guaranteed in OpenCL due to the specification allowing work-item merging without indication of the relative execution paths of different work-items merged into an HW thread.

A useful addition for the next OpenCL specification would be something to improve support for wave-synchronous programming. This would require at least two changes:

(1) a kernel attribute to prevent work-item merging;

(2) a device attribute indicating the amount of work _threads_ that run physically in lockstep.

Note however that while this solves the problem for current GPUs, it might be absolutely inefficient for CPUs (that only have 1 thread per physical or virtual core, but may be able to process parts of a kernel more efficiently by vectorizing code), or for some non-CPU, non-GPU hardware (e.g. the Xeon Phi), and might not even be a good programming model in future GPUs.

View solution in original post

nou · ‎06-11-2013

pass preffered workgroup size value into kernel as parameter.

roger512 · ‎06-11-2013

Well, I didn't hear about a way to do that. As far as i know wavesize are 64 for AMD GPU and 32 for NVidia GPU.

So that means you need to look CL_DEVICE_VENDOR with clGetDeviceInfo and infer the wavesize then pass it to kernel as parameter or use a define.

The new AMD specs HSA has that kind of features, but it's only specs for now.

himanshu_gautam · ‎06-12-2013

Thanks nou and roger for your inputs

gbilotta · ‎06-12-2013

You cannot query the wavefront width or warp size from a kernel, but you can query it from the host and pass it to the device as a parameter (or in a constant memory struct or whatever). Host-side query can be done by using vendor-specific extensions:

(1) for devices that support cl_nv_device_attribute_query, you can call clGetDeviceInfo with the CL_DEVICE_WARP_SIZE_NV flag. This will return 32 on all current Nvidia device, but beware that Nvidia has started warning CUDA developers that the warp size may change in the future, and this will obviously affect OpenCL users as well (assuming the flag will return the correct value, which considering how little nvidia seems to be supportive of opencl, is not guaranteed).

(2) for devices that support cl_amd_device_attribute_query, there is the equivalent CL_DEVICE_WAVEFRONT_WIDTH_AMD flag for clGetDeviceInfo. While most AMD devices have a wavefront width of 64, some older, lower-class devices actually have a wavefront width of 32, so querying this property (if possible) is better than just assuming AMD => 64. (Also note that this flag is only present on AMD _GPU_ devices, you'll get an invalid value error when querying for it on CPU devices supported by the AMD platform.)

Ideally, there should be a common extension, or even better a core property, to query this kind of information (since it could be useful for other vendors aside from nvidia and amd as well). You could try opening an issue in the bugzilla and/or proposing the change on the khronos opencl forum as well.

EDIT: actually, the value can be obtained by querying the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE property of a kernel using clGetKernelWorkGroupInfo(). Although this is a kernel-specific properties, on both AMD and NV GPUs it actually returns the warp size/wavefront width.

himanshu_gautam · ‎06-14-2013

Yes, MULIPLE thing is the right thing to do - as I understand.

-

Bruhaspati

LeeHowes · ‎06-14-2013

It will work, but it isn't standard that it has those semantics. There's no reason why a compiler couldn't (and VERY good reasons why it should) pack multiple work items into a single SIMD lane so that, in effect, we end up with 128 work-items in a wavefront. At that point the kernel should report a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 128, not 64. At this time the toolchain for the GPU may not do this, but the fact remains that it could and the specification does not preclude this behaviour.

The device-specific flags seem the only safe way, and even then because the toolchain makes only very weak guarantees you have to be careful what assumptions you make about the toolchain behaviour. For example, dropping barriers within wave-synchronous code is obviously the right way to program vector hardware, but it is not necessarily the right way to program OpenCL because the compiler makes no guarantee that it will respect any communication you are doing.

gbilotta · ‎06-15-2013