I wrote some code that uses 32 VGPRs in GCN asm, and ran it on Hawaii/R290x through OpenCL using clCreateProgramWithBinary()/clBuildProgram()/clCreateKernel() with global work size = 64 (threads per wavefront) * 44 (compute units on R290x) * 8 (waves per CU) = 22,528 and local work size = 64. This corresponds to one OpenCL workgroup per wave, and using 32 VGPRs and not exceeding the other limits, such as SGPR count and LDS alloc, the entire graphics card should be able to accommodate all of these threads at once.
Following the Sea Islands ISA document found at http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf, table 5.9 on page 48, I've tried querying the base VGPR and the number of VGRPs assigned to each wavefront like so
s_getreg_b32 s0, HWREG(GPR_ALLOC, 0, 6) /* vgpr base ofs, ofs = 4*result ? */
store s0 to vgprBaseOutputArray[global_thread_id]
s_getreg_b32 s0, HWREG(GPR_ALLOC, 8, 6) /* vgpr size, num VGPRs = 4*(size+1) ? */
store s0 to vgprSizeOutputArray[global_thread_id]
The ISA doc says that when reading VGPR_BASE, I should be getting the 'Physical address of first VGPR assigned to this wavefront, as [7:2]', and I should be getting the 'Number of VGPRs assigned to this wavefront, as [7:2], 0=4 VGPRs, 1=8 VGPRs, etc.' when reading VGPR_SIZE.
I'm getting 7 for all threads when reading VGPR_SIZE, which corresponds to the correct number of 32 VGPRs. That is fine. However, I'm getting VGPR base set to 0 for threads with global_id = 0..11263 (exactly the first half of 64*44*8 = 22,528 total threads) and VGPR base set to 8 (the value returned by s_getreg_b32) * 4 (due to the [7:2] format) = 32 for threads with global_id = 11,264..22,527 (the second half, exactly).
To my understanding of the ISA doc, the first 64 threads/wave 0 should be reporting VGPR base set to 0, the next 64 threads/wave 1 set to 32 (after multiplying s_getreg_b32 outpuit by 4), etc. Yet, I'm getting 0 for a whole bunch of threads, and 32 for a whole bunch of other threads.
I feel like I'm making a rookie mistake. What am I missing or doing wrong, and what did I not understand correctly?
If it matters, I'm on Windows 10 Home 64 bit with 17.10.2 drivers, and I'm using CL Radeon Extender as the assembler. Note that there are two assembler directives, .pgmsrc1 and .pgmsrc2, that I didn't specify in the source code (I've specified .sgprsnum 11 and .vgprsnum 32 though), and when I disassemble the compiled kernel, I get
in the output. Perhaps, these are necessary, and they somehow make the register assignments clash between the waves, or something? How do these work and are they documented anywhere?
Thank you for your help in advance!