Hi Karl, Some random notes that I hope might help. (im not a Open CL programmer so maybe someone else can help more)
Each CU 64K vgprs (aka 256KB) and this is shared between 4 sets(at the minimum) of 64 threads so that leaves 256/thread. Another way to look at it is you have 256KB / 4 bytes_per_reg = 64K vgprs. 64K vgprs/ 4 wavefronts = 16K vgprs per wavefront. 16K vgprs/ 64 threads = 256 regs. (also 256 vgprs/thread is the hardware the limit)
With that said, I am note sure how "two uint16[512]" fit in vgprs because that would be 1024 regs( 2X512). Also vgprs cannot always be accessed via index so maybe this array is in shared memory. (32KB max)
Performance note: using 92 vgprs will probably not result in good performance. This will only fit 10 wavefronts/CU so it will not be able to hide latencies very well. Its best to have 16-32 wavefronts in most situations. Using anything over 128 vgprs will always hurt performance. If your code has lots of main memory access or is a large kernel. Its best to keep it 64 or less if possible.