CaptainN

SR (Shared Registers), sharing level.

Discussion created by CaptainN on Feb 10, 2010
Latest reply on Feb 18, 2010 by CaptainN
Whether SRx shared per SIMD thread, per EVEN/ODD Wavefront’ thread, or per Wavefront# (of SIMD)’ thread.

Ultimately, the task is to pass data between kernels using SR (Shared registers). SR registers supposed to be persistent from thread to thread (within waveform) between different kernel invocations when launched from calCtxRunProgramGridArray)

Micah,

I really tried to make it short post.

Based on number of answers and posts (http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=116932, http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=115872, http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=121826) I wrote number of tests and still can not say for sure what is the sharing level of SR registers. Following documentation (http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf) SR shared per SIMD.

However, results rather do not confirm this statement. I have used RV730 (HD4670) chip for these experiments.

Allocated array 128x4 of float4. Size is doubled, reason will be seen below.

Total 3 kernels is used for experiments (run from calCtxRunProgramGridArray): krn#1 reset sr0 and g[0..511]; krn#2 incr. sr0.x by 1; krn#3 reads sr0 and writes to g[0..511] (for even and odd wavefronts to compare). Every elem. of g[] (float4) array has sr0 value: number of increments by krn#2 (x), tid of krn#1 (y), tid of krn #2(z) and tid of krn #3(w) (yes, I am following lpw’ idea).

:

1.       Thread block == 32, Group size == 8. Total 256 threads to run, to occupy full capacity.  g[0..255] has been populated as expected, i.e. g[0.255].x == 1 (incremented by 1 in krn#2), g[0..255].y == g[0..255].z == g[0..255].w == 0..255. In this case SR looks like shared per SIMD.

 

2.       Thread block == 32, Group size == 16 for krn#1 and krn#3, but Group size == 8 for krn#2. Total 512 threads to run, to occupy full capacity twice (for setup and read SR to g[] kernels). Every SIMD will get scheduled twice for krn#1 and krn#3, and it will be odd and even wavefronts there.  However, Groupsize of krn#2 I set to 8, so every thread for krn#2 will run only once on odd wavefront, where sr.x will be incremented by 1.

Krn#3 populates g[0..511] this way: i.e. . g[0..511].x == 1 (SR.x), g[0..511].y == g[0..511].z == g[0..511].w == 0..511. Both even and odd wavefronts of krn#3 read SR.x as 1, so again, it looks like SR shared per SIMD.

 

3.       Thread block == 32, Group size == 16. Total 512 threads to run, to occupy full capacity twice. Every SIMD will get scheduled twice, and it will be odd and even wavefronts for all 3 kernels.  g[0..511] has been populated this way, i.e. . g[0..511].x == 1, g[0..511].y == g[0..511].z == g[0..511].w == 0..511. If SR is shared per SIMD, then this value must be ==2, because even and odd wavefronts should increment it once, so total it should be 2  on time when krn#3 reads SR0 to place into g[].x. So it makes me think SR shared by wavefront (so one ” instance” of SR0 for even wavefront, and another SR0 “instance” for odd wave front).

 

If I increase number of GroupSize for krn. #2 to 32, (while keeping krn#3 and krn#1 GroupSize == 16) the values in SR found are neither 2 (if it shared by odd/even wavefront), nor 4 (if it is shared per SIMD), but ==  3. How come? This is possibly wrong setup as group size is different between kernel invocations (seems ok, though), but possibly looks like SR shared between wavefront # (so every of 4 wavefronts have its own SR0, not confirmed though).

Also, if krn#2 increases sr0.x by 2 (two add sr0.x, sr0.x, l0.w ops, where l0.w set to 1.0),  in case Thread block == 32 and group size == 16 for all of 3 kernels ( example#3 above), g[0..511].x seen == 3, what I can not explain either.

So, what is the sharing level of SR?

Outcomes