Hi Karl, Some random notes that I hope might help. (im not a Open CL programmer so maybe someone else can help more)
Each CU 64K vgprs (aka 256KB) and this is shared between 4 sets(at the minimum) of 64 threads so that leaves 256/thread. Another way to look at it is you have 256KB / 4 bytes_per_reg = 64K vgprs. 64K vgprs/ 4 wavefronts = 16K vgprs per wavefront. 16K vgprs/ 64 threads = 256 regs. (also 256 vgprs/thread is the hardware the limit)
With that said, I am note sure how "two uint16" fit in vgprs because that would be 1024 regs( 2X512). Also vgprs cannot always be accessed via index so maybe this array is in shared memory. (32KB max)
Performance note: using 92 vgprs will probably not result in good performance. This will only fit 10 wavefronts/CU so it will not be able to hide latencies very well. Its best to have 16-32 wavefronts in most situations. Using anything over 128 vgprs will always hurt performance. If your code has lots of main memory access or is a large kernel. Its best to keep it 64 or less if possible.
thank you for your answer. i use now __global mem (as i need 128KB per Thread) and try to hide latency via parallel execution of wavefronts (don't know if i got the terminology right). Anyhow the opencl compiler does pretty much what he wants, is not predictable and uses way to many vgprs... but i guess i will find my way somehow.