cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

timchist
Elite

How to query wavefront size from kernel?

How can I query wavefront size from a kernel? (is there an analog to the warpSize built-in variable in CUDA).

0 Likes
1 Solution


LeeHowes wrote:



It will work, but it isn't standard that it has those semantics. There's no reason why a compiler couldn't (and VERY good reasons why it should) pack multiple work items into a single SIMD lane so that, in effect, we end up with 128 work-items in a wavefront. At that point the kernel should report a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 128, not 64. At this time the toolchain for the GPU may not do this, but the fact remains that it could and the specification does not preclude this behaviour.



The device-specific flags seem the only safe way, and even then because the toolchain makes only very weak guarantees you have to be careful what assumptions you make about the toolchain behaviour. For example, dropping barriers within wave-synchronous code is obviously the right way to program vector hardware, but it is not necessarily the right way to program OpenCL because the compiler makes no guarantee that it will respect any communication you are doing.


Actually, in some way, this is precisely the reason why you should query the kernel-specific flag rather than the platform-specific device flag, in most cases.

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE is essentially a performance hint, mentioning that to get full utilization of the hardware, the wg size should be a multiple of the given amount, otherwise some of the lanes might remain idle. (Interesting tidbit: Intel's platform on CPU recommends WGs multiple of 128 usually, even when it is NOT the actual SIMD width of the CPU.). In most cases, this is what you actually want.

The only case that I can think of when you would want to know the actual width of the wave/warp is when you are doing wave-synchronous programming, which is, or rather used to be, a very efficient way of GPU coding, since it relieve the need for block-synchronization instructions. However,as GPU hardware becomes more sophisticated and GPU compilers try to be more efficient in latency covering by interleaving independent instructions, wave-synchronous programming becomes less and less reliable. In fact, even NVIDIA has started recommending that warp-sync programming  not be used in CUDA, so this is not even strictly related by the programming model assumptions done in OpenCL.

Additionally, as you mention, in OpenCL itself wave-synchronous programming cannot be guaranteed (although you can go a long way by enforcing ordering with volatile qualifiers). Let's consider the case of a (future) GPU platform that does autovectorization and work-item merging. If your kernel uses wave-synchronous programming (no barriers) and the GPU decides e.g.  to merge two items into a single hw thread, there is no way to guarantee the correctness of execution, even if you use the hardware-specific wavefront size as your wg size. Assume for example a 64-wide wave. Then, by the compiler workitem merge you would have 128 workitems processed by each wave. If you set a WG size of 128, a single wave will be executed per WG, but the order in which the two workitems will be processed by each thread is actually unknown (they might be processed concurrently, or interleaved, or any other combination). If you set a WG size of 64 (the hardware wavefront width), you still have the problem of the unknown ordering within the single thread, and in addition you'll only be using half of the wave per group. So by using the HW wave size you still have no guarantee that the wave-synchronous code will behave correctly, and by not using the preferred WG size multiple you're underutilizing your hardware.

So there are actually two things that are emerging from this discussion, from what I see.

(1) you should always use the preferred WG size multiple flag instead of the vendor-specific device attributes; they will give you the same results if the compilers don't merge work-items, and they will give you the correct result (for the specific kernel) in vectorizing compiler, performance-wise;

(2) wave-synchronous programming cannot be reliably guaranteed in OpenCL due to the specification allowing work-item merging without indication of the relative execution paths of different work-items merged into an HW thread.

A useful addition for the next OpenCL specification would be something to improve support for wave-synchronous programming. This would require at least two changes:

(1) a kernel attribute to prevent work-item merging;

(2) a device attribute indicating the amount of work _threads_ that run physically in lockstep.

Note however that while this solves the problem for current GPUs, it might be absolutely inefficient for CPUs (that only have 1 thread per physical or virtual core, but may be able to process parts of a kernel more efficiently by vectorizing code), or for some non-CPU, non-GPU hardware (e.g. the Xeon Phi), and might not even be a good programming model in future GPUs.

View solution in original post

9 Replies
nou
Exemplar

pass preffered workgroup size value into kernel as parameter.

0 Likes
roger512
Adept II

Well, I didn't hear about a way to do that. As far as i know wavesize are 64 for AMD GPU and 32 for NVidia GPU.

So that means you need to look CL_DEVICE_VENDOR with clGetDeviceInfo and infer the wavesize then pass it to kernel as parameter or use a define.

The new AMD specs HSA has that kind of features, but it's only specs for now.

0 Likes

Thanks nou and roger for your inputs

0 Likes
gbilotta
Adept III

You cannot query the wavefront width or warp size from a kernel, but you can query it from the host and pass it to the device as a parameter (or in a constant memory struct or whatever). Host-side query can be done by using vendor-specific extensions:

(1) for devices that support cl_nv_device_attribute_query, you can call clGetDeviceInfo with the CL_DEVICE_WARP_SIZE_NV flag. This will return 32 on all current Nvidia device, but beware that Nvidia has started warning CUDA developers that the warp size may change in the future, and this will obviously affect OpenCL users as well (assuming the flag will return the correct value, which considering how little nvidia seems to be supportive of opencl, is not guaranteed).

(2) for devices that support cl_amd_device_attribute_query, there is the equivalent CL_DEVICE_WAVEFRONT_WIDTH_AMD flag for clGetDeviceInfo. While most AMD devices have a wavefront width of 64, some older, lower-class devices actually have a wavefront width of 32, so querying this property (if possible) is better than just assuming AMD => 64. (Also note that this flag is only present on AMD _GPU_ devices, you'll get an invalid value error when querying for it on CPU devices supported by the AMD platform.)

Ideally, there should be a common extension, or even better a core property, to query this kind of information (since it could be useful for other vendors aside from nvidia and amd as well). You could try opening an issue in the bugzilla and/or proposing the change on the khronos opencl forum as well.

EDIT: actually, the value can be obtained by querying the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE property of a kernel using clGetKernelWorkGroupInfo(). Although this is a kernel-specific properties, on both AMD and NV GPUs it actually returns the warp size/wavefront width.


Yes, MULIPLE thing is the right thing to do - as I understand.

-

Bruhaspati

0 Likes

It will work, but it isn't standard that it has those semantics. There's no reason why a compiler couldn't (and VERY good reasons why it should) pack multiple work items into a single SIMD lane so that, in effect, we end up with 128 work-items in a wavefront. At that point the kernel should report a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 128, not 64. At this time the toolchain for the GPU may not do this, but the fact remains that it could and the specification does not preclude this behaviour.

The device-specific flags seem the only safe way, and even then because the toolchain makes only very weak guarantees you have to be careful what assumptions you make about the toolchain behaviour. For example, dropping barriers within wave-synchronous code is obviously the right way to program vector hardware, but it is not necessarily the right way to program OpenCL because the compiler makes no guarantee that it will respect any communication you are doing.

0 Likes


LeeHowes wrote:



It will work, but it isn't standard that it has those semantics. There's no reason why a compiler couldn't (and VERY good reasons why it should) pack multiple work items into a single SIMD lane so that, in effect, we end up with 128 work-items in a wavefront. At that point the kernel should report a CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE of 128, not 64. At this time the toolchain for the GPU may not do this, but the fact remains that it could and the specification does not preclude this behaviour.



The device-specific flags seem the only safe way, and even then because the toolchain makes only very weak guarantees you have to be careful what assumptions you make about the toolchain behaviour. For example, dropping barriers within wave-synchronous code is obviously the right way to program vector hardware, but it is not necessarily the right way to program OpenCL because the compiler makes no guarantee that it will respect any communication you are doing.


Actually, in some way, this is precisely the reason why you should query the kernel-specific flag rather than the platform-specific device flag, in most cases.

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE is essentially a performance hint, mentioning that to get full utilization of the hardware, the wg size should be a multiple of the given amount, otherwise some of the lanes might remain idle. (Interesting tidbit: Intel's platform on CPU recommends WGs multiple of 128 usually, even when it is NOT the actual SIMD width of the CPU.). In most cases, this is what you actually want.

The only case that I can think of when you would want to know the actual width of the wave/warp is when you are doing wave-synchronous programming, which is, or rather used to be, a very efficient way of GPU coding, since it relieve the need for block-synchronization instructions. However,as GPU hardware becomes more sophisticated and GPU compilers try to be more efficient in latency covering by interleaving independent instructions, wave-synchronous programming becomes less and less reliable. In fact, even NVIDIA has started recommending that warp-sync programming  not be used in CUDA, so this is not even strictly related by the programming model assumptions done in OpenCL.

Additionally, as you mention, in OpenCL itself wave-synchronous programming cannot be guaranteed (although you can go a long way by enforcing ordering with volatile qualifiers). Let's consider the case of a (future) GPU platform that does autovectorization and work-item merging. If your kernel uses wave-synchronous programming (no barriers) and the GPU decides e.g.  to merge two items into a single hw thread, there is no way to guarantee the correctness of execution, even if you use the hardware-specific wavefront size as your wg size. Assume for example a 64-wide wave. Then, by the compiler workitem merge you would have 128 workitems processed by each wave. If you set a WG size of 128, a single wave will be executed per WG, but the order in which the two workitems will be processed by each thread is actually unknown (they might be processed concurrently, or interleaved, or any other combination). If you set a WG size of 64 (the hardware wavefront width), you still have the problem of the unknown ordering within the single thread, and in addition you'll only be using half of the wave per group. So by using the HW wave size you still have no guarantee that the wave-synchronous code will behave correctly, and by not using the preferred WG size multiple you're underutilizing your hardware.

So there are actually two things that are emerging from this discussion, from what I see.

(1) you should always use the preferred WG size multiple flag instead of the vendor-specific device attributes; they will give you the same results if the compilers don't merge work-items, and they will give you the correct result (for the specific kernel) in vectorizing compiler, performance-wise;

(2) wave-synchronous programming cannot be reliably guaranteed in OpenCL due to the specification allowing work-item merging without indication of the relative execution paths of different work-items merged into an HW thread.

A useful addition for the next OpenCL specification would be something to improve support for wave-synchronous programming. This would require at least two changes:

(1) a kernel attribute to prevent work-item merging;

(2) a device attribute indicating the amount of work _threads_ that run physically in lockstep.

Note however that while this solves the problem for current GPUs, it might be absolutely inefficient for CPUs (that only have 1 thread per physical or virtual core, but may be able to process parts of a kernel more efficiently by vectorizing code), or for some non-CPU, non-GPU hardware (e.g. the Xeon Phi), and might not even be a good programming model in future GPUs.

> Actually, in some way, this is precisely the reason why you should query the kernel-specific flag rather than the platform-specific device flag, in most cases.

Of course, and indeed precisely because of that work-item merging. The problem is that the kernel flag does not mean "wavefront size". It means "a hint about what multiple we should use for group dispatch" and there could be other reasons for that that are independent of the wavefront size. Knowing that value under the OpenCL definition guarantees you nothing about the size of a hardware thread, what unit of execution makes forward progress, what unit of execution acts synchronously so as to allow barrier elision and so on. You're right that a per-kernel flag is important (and watch this space on that subject) but the current OpenCL flag isn't strong enough to give the behaviour that many people need other than the fact that we and NVIDIA use it a particular way - but given that we and nvidia don't change it per kernel right now there are assumptions either way.

I would hope that a vendor that did provide work-item merging would add a similar custom query to allow querying of that from the kernel. For the CPU compiler flows are already fairly flexible about this. One work-item per core on the CPU isn't very efficient, we should at least be packing into the CPU wavefront properly (4 or 8 generally). However inflexibility of CPU vector units means that this can vary a little in efficiency as we see in Intel's toolflow.

0 Likes

Thanks for your detailed comment, gbilotta. I indeed was intending to use warp/wavefront-synchronous programming. Yes, I'm aware of potential problems that it might cause, and of course I'm familiar with NVIDIA's recommendations on the issue. However, when you need speed... on a given hardware... you just use everything that works.

0 Likes