Is there any way to control via OpenCL which variables will allocate scalar instead of vector registers on GCN devices? Using expressions like get_global_id(0)/64, which should contain the same value for all workitems within a wavefront, seem to be allocated on vector registers.
OpenCL programmers do not have a direct control over which registers will be used for _private allocations. vGPRs are used for private memory as sGPRs are reserved for internal purposes.
Proper (and also portable) way to share a value across a wave-front is to put it in local memory with _local qualifier, and leave lower level tools (compiler/linker/driver) take care of their mapping to exact hardware resources.
Register allocation is the job of compiler. During the compilation, the compiler decides how to allocate these variables to small, finite set of registers with aim to allocate as many variables to register as possible. The compiler tries to map private memory allocations to the pool of GPRs in the GPU. In the event GPRs are not available, private memory is mapped to the “scratch” region, GPRs have some restrictions about which register ports can be read on each cycle; but generally, these are not exposed to the OpenCL programmer.
In GCN devices, there are two types of GPRs: scalar GPRs (SGPRs) and vector GPRs (VGPRs). Each CU has four vector units and one scalar unit and each vector unit has its own SGPR and VGPR pool. There are 512 SGPRs and 256 VGPRs per vector unit. The vector unit handles all vector instructions (any instruction that is handled per thread). And SGPRs are used for scalar instructions: any instruction that is executed once per wavefront, such as a branch, a scalar ALU instruction and constant cache fetches. SGPRs are also used for constants, all buffer/texture definitions, and sampler definitions; some kernel arguments are stored, at least temporarily, in SGPRs.
So if a programmer wants to use scalar registers instead of vector registers for variables, he should try to use scalar instructions which can be coded using branches, computation on constant memory and etc as mentioned above.
My intension was not to share values within wave-fronts but to examine it as an optimization method. I thought it provided an opportunity to reduce vector register pressure and off-load operations from the vector units to the scalar unit when it is certain that all workitems perform the same operation on the same values.