I have read many threads about the topic here, and one of the best one probably this one:
http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99919
however, there are number of simple questions outstanding:
a) it seems to be conclusive, that wavefronts are executing within the Thread Group. Thread Group size is defined by gridBlock.width parameter of CALprogramGrid structure. And number of Thread Groups are defined as domain execution size (in pixels) devided by Thread Group size.
b) If Thread Group size is twice more then actual execution units (for 7xx # of execution units seems to be == 64), and set in kernel and in gridBlock.width, whether Thread Group will queue 2 wavefronts on the same SIMD still being within the same Thread Group without interruption?
c) If fence_ work per Group, and Group Size is more then available execution units, and execution split on 2 wavefornts (case b above), whether first wavefront will be deferred until second wavefront will reach the barrier, to have first wavefront to be continued? Or it is just incorrect setting to have Group Size > then actual execution units per SIMD?
d) If wavefornt size is set to ½ of executing units of SIMD, whether half of SIMD will be wasted or another Group will be started on half of SIMD?
e) If there are more Groups set then available SIMDs, whether groups will be scheduled for execution one after another in some non-predictive order until finished?
f) Once wavefront execution finished, whether LDS content remains persistent between wavefront runs, so next Thread Group will find LDS content from previous wavefront and can be reused?
Thank you, Micah!!!
I appreciate your time answering this questions.
To clarify:
B) Do you mean that if 1 thread group scheduled to 1 SIMD, and if thread group require more resources then SIMD can give for this group then scheduler will wait till part for the thread group finishes first, and schedule another wavefront for the same group, and does not release SIMD resource until all threads are finished? Whether control has not returned back until all threads for given Group has finished?
D) Presuming 1 SIMD has 64 execution units (7xx case):
Did you describe the case when kernel has (pseudo code below)
dcl_num_thread_per_group 64
CALProgramGrid.gridBlock::width = 64
CALProgramGrid.gridBlock::height = 1
CALProgramGrid.gridBlock:epth = 1
CALProgramGrid.gridSize::width = 1
CALProgramGrid.gridSize::width = 1
CALProgramGrid.gridSize::width = 1
But
CALdomain3D::width = 8
CALDomain3D::heigh = 4
(making domain size eq. to 32, what is half of threads declared) so in this case half of the SIMD will be wasted?
-------------------
Whether the second half of SIMD will be wasted when
1 SIMD has 64 execution units (7xx case) and
Kernel has
dcl_num_thread_per_group 32
CALProgramGrid.gridBlock::width = 32
CALProgramGrid.gridBlock::height = 1
CALProgramGrid.gridBlock:epth = 1
CALProgramGrid.gridSize::width = 1
CALProgramGrid.gridSize::width = 1
CALProgramGrid.gridSize::width = 1
CALdomain3D::width = 8
CALDomain3D::heigh = 4
?
Or it will be able to accept another similar Thread Group say, from another context? Even if kernel program in another context is different?
E) What is the behavior for 8xx?
Dear Micah,
just want to get down on this
B.1) Using following declarations (pseudo code below)
dcl_num_thread_per_group 64
CALProgramGrid.gridBlock::width = 64
CALProgramGrid.gridBlock::height = 1
CALProgramGrid.gridBlock:: depth = 1
CALProgramGrid.gridSize::width = 1024
CALProgramGrid.gridSize::height = 1
CALProgramGrid.gridSize:: depth = 1
With
CALdomain3D::width = 256
CALDomain3D::heigh = 256
(making domain Thread Group count eq. to 1024).
HD4600 seems to be fine (with delcartion of thread per group == 64), while attr. call return the waveformSize eq. to 32.
This way 1 Thread Group is size of 64, but there are only 32 execution units.
How it works?
B.2) Also, by saying “Thread groups are scheduled on SIMD's until the SIMD cannot hold any more thread groups and then it waits for more resources to be cleared by the execution of a thread group finishing” do you mean that if 2 Thread Groups can fit into 1 SIMD then they will execute together?
For example, in case
dcl_num_thread_per_group 32
CALProgramGrid.gridBlock::width = 32
CALProgramGrid.gridBlock::height = 1
CALProgramGrid.gridBlock:: depth = 1
CALProgramGrid.gridSize::width = 2048
CALProgramGrid.gridSize::height = 1
CALProgramGrid.gridSize:: depth = 1
With
CALdomain3D::width = 256
CALDomain3D::heigh = 256
(making domain Thread Group count eq. to 1024).
then 2 Thread Groups will be executing on 1 SIMD of 7xx (64 exec. units per SIMD), unless address declared as absolute?
E)
In case of example
dcl_num_thread_per_group 64
CALProgramGrid.gridBlock::width = 64
CALProgramGrid.gridBlock::height = 1
CALProgramGrid.gridBlock:: depth = 1
CALProgramGrid.gridSize::width = 1024
CALProgramGrid.gridSize::height = 1
CALProgramGrid.gridSize:: depth = 1
With
CALdomain3D::width = 256
CALDomain3D::heigh = 256
(making domain Thread Group count eq. to 1024).
Does it mean that Thread Groups will be allocated 1 per SIMD in round robin fashion through SIMDs, and if one of the SIMDs will take longer time to execute (due-to code branch, for example) it will slow down whole dispatch procedure, possibly waiting on SIMD to finish before scheduling rest of SIMDs, or dispatcher schedule SIMDs on available basis?
Micah,
Can you please comment at least for B.1 question above?
Thanks.
Ok, I see.
Therefore, whether
a) calCtxRunProgramGrid call will schedule work only for 1 SIMD (assigning gridBlock*gridSize threads to choosen SIMD),
b) or single calCtxRunProgramGrid call will destribute load accross SIMDs (gridBlock per SIMD), and schedule rest of gridBlocks as long as SIMDs become free?
(I.e. Shall I call calCtxRunProgramGrid in the loop to load all SIMDs in the GPU?)
Also,
Whether shared registers (sr# one) will be persistent over kernel invocations only when kernels called from calCtxRunProgramGrid only as well?
Micah, Thanks for all the things you been posted so far!
So, as a result of single calCtxRunProgramGrid call, GPU will schedule runs among ALL available SIMDs, not on just single SIMD. Correct?