SR provides a limited form of sharing. They're called "shared_temp" in IL, see dcl_shared_temp.
On most GPUs where the wavefront size is 64 there will be 64 distinct SRs, all called "SR0". If there are 150 wavefronts executing on the SIMD (e.g. 7 at any one time) then they will all share the declared SRs.
Sharing is by lane and there are 64 lanes in each wavefront. So for example lane 3 has "SR0" which is shared by all 150 wavefronts. Lane 4 has "SR0" that is separate, but also shared by all 150 wavefronts.
You can define multiple SR registers, so if you have 2 defined, then the total population of SRs that are reserved on the SIMD core is 128. This will provide SR0 and SR1 and for each of the 64 lanes.
Obviously there are many concepts on this chip that I don't know about. I have looked into the documentation as well as the samples, there wasn't too much information to say the least.
So I'm afraid I have to ask: what's a lane?
If I wanted thread #0 to transmit one register to thread #16, how would the IL sequence look like?
"lane" just refers to any one of the 64 paths through the virtual, 64-wide, SIMD.
The SIMD is physically 16-wide, but it emulates a 64-wide SIMD by running each instruction over 4 successive cycles.
To move data from work item 0 to work item 16 can be done either through local memory (LDS on the chip) or through global memory (or even through GDS but that'll be very tricky).
SRs are not suitable for this.