cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

DTop
Staff

Compute Shader scheduling

About deterministic scheduler behavior, wavefront run and LDS access.

I have read many threads about the topic here, and one of the best one probably this one:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99919

however, there are number of simple questions outstanding:

 

a)      it seems to be conclusive, that wavefronts are executing within the Thread Group. Thread Group size is defined by gridBlock.width parameter of CALprogramGrid structure. And number of Thread Groups are defined as domain execution size (in pixels) devided by Thread Group size.

b)      If Thread Group size is twice more then actual execution units (for 7xx # of execution units seems to be == 64), and set in kernel and in gridBlock.width, whether Thread Group will queue 2 wavefronts on the same SIMD still being within the same Thread Group without interruption?

c)       If fence_ work per Group, and Group Size is more then available execution units, and execution split on 2 wavefornts (case b above), whether first wavefront will be deferred until second wavefront will reach the barrier, to have first wavefront to be continued? Or it is just incorrect setting to have Group Size > then actual execution units per SIMD?

d)      If wavefornt size is set to ½ of executing units of SIMD, whether half of SIMD will be wasted or another Group will be started on half of SIMD?

e)      If there are more Groups set then available SIMDs, whether groups will be scheduled for execution one after another in some non-predictive order until finished?

f)        Once wavefront execution finished, whether LDS content remains persistent between wavefront runs, so next Thread Group will find LDS content from previous wavefront and can be reused?

0 Likes
9 Replies

A) All wavefronts within a thread group are executed on a single SIMD. Two wavefronts can execute in parallel(commonly refered to as the even and odd wavefront) on a single SIMD.
B) Thread groups are scheduled on SIMD's until the SIMD cannot hold any more thread groups and then it waits for more resources to be cleared by the execution of a thread group finishing.
C) If your group size is larger than a single wavefront, then when the first wavefront hits the barrier, it will wait for the rest of the wavefronts to hit the barrier before continuing execution.
D) The other half of the wavefront is marked inactive and no execution will occur with those threads.
E) The order is sequential over the SIMD's before wrapping to the original SIMD. On 7XX this behavior can be modified with setting the addressing mode to wavefront absolute instead of wavefront relative, this causes only a single group to be executed per SIMD no matter what. There can only be as many groups executed per SIMD as resources allow.
F) Within a single kernel execution, LDS content is persistent. Between kernel executions, LDS content is persistent within the same command buffer, which is only gauranteed with the calCtxRunProgramGridArray API call. If this API call cannot fit all kernels in a single command buffer, then the call fails.
0 Likes
DTop
Staff

Thank you, Micah!!!

I appreciate your time answering this questions.

 

To clarify:

 

B) Do you mean that if 1 thread group scheduled to 1 SIMD, and if thread group require more resources then SIMD can give for this group then scheduler will wait till part for the thread group finishes first, and schedule another wavefront for the same group, and does not release SIMD resource until all threads are finished? Whether control has not returned back until all threads for given Group has finished?

 

D) Presuming 1 SIMD has 64 execution units (7xx case):

Did you describe the case when kernel has (pseudo code below)

dcl_num_thread_per_group 64

 

CALProgramGrid.gridBlock::width = 64

CALProgramGrid.gridBlock::height = 1

CALProgramGrid.gridBlock:epth = 1

 

CALProgramGrid.gridSize::width = 1

CALProgramGrid.gridSize::width = 1

CALProgramGrid.gridSize::width = 1

 

 

But

CALdomain3D::width = 8

CALDomain3D::heigh = 4

(making domain size eq. to  32, what is half of threads declared) so in this case half of the SIMD will be wasted?

 

-------------------

Whether the second half of SIMD will be wasted when

1 SIMD has 64 execution units (7xx case) and

Kernel has

dcl_num_thread_per_group 32

 

CALProgramGrid.gridBlock::width = 32

CALProgramGrid.gridBlock::height = 1

CALProgramGrid.gridBlock:epth = 1

 

CALProgramGrid.gridSize::width = 1

CALProgramGrid.gridSize::width = 1

CALProgramGrid.gridSize::width = 1

 

CALdomain3D::width = 8

CALDomain3D::heigh = 4

?

 

Or it will be able to accept another similar Thread Group say, from another context? Even if kernel program in another context is different?

 

E) What is the behavior for 8xx?

 

0 Likes

B) If your thread group requires more resources than the entirety of the SIMD, execution or compilation will fail. It is an all or nothing approach.

D) yes, half of the threads will be wasted in both examples. Multiple groups are not packed into a single wavefront.

E) 8XX is a derivative of 7XX , so the behaviour should be the same, but I have not had a chance to look deeply into it yet.
0 Likes

Dear Micah,

just want to get down on this

B.1) Using following declarations (pseudo code below)

dcl_num_thread_per_group 64

 

CALProgramGrid.gridBlock::width = 64

CALProgramGrid.gridBlock::height = 1

CALProgramGrid.gridBlock:: depth = 1

 

CALProgramGrid.gridSize::width = 1024

CALProgramGrid.gridSize::height = 1

CALProgramGrid.gridSize:: depth = 1

 

 

With

CALdomain3D::width = 256

CALDomain3D::heigh = 256

(making domain Thread Group count  eq. to  1024).

 

HD4600 seems to be fine (with delcartion of thread per group == 64), while attr. call return the waveformSize eq. to 32.

 

This way 1 Thread Group is size of 64, but there are only 32 execution units.

How it works?

 

B.2) Also, by saying “Thread groups are scheduled on SIMD's until the SIMD cannot hold any more thread groups and then it waits for more resources to be cleared by the execution of a thread group finishing” do you mean that if 2 Thread Groups can fit into 1 SIMD then they will execute together?

For example, in case

dcl_num_thread_per_group 32

 

CALProgramGrid.gridBlock::width = 32

CALProgramGrid.gridBlock::height = 1

CALProgramGrid.gridBlock:: depth = 1

 

CALProgramGrid.gridSize::width = 2048

CALProgramGrid.gridSize::height = 1

CALProgramGrid.gridSize:: depth = 1

 

With

CALdomain3D::width = 256

CALDomain3D::heigh = 256

(making domain Thread Group count  eq. to  1024).

then 2 Thread Groups will be executing on 1 SIMD of 7xx (64 exec. units per SIMD), unless address declared as absolute?

 

E)

In case of example

dcl_num_thread_per_group 64

 

CALProgramGrid.gridBlock::width = 64

CALProgramGrid.gridBlock::height = 1

CALProgramGrid.gridBlock:: depth = 1

 

CALProgramGrid.gridSize::width = 1024

CALProgramGrid.gridSize::height = 1

CALProgramGrid.gridSize:: depth = 1

 

 

With

CALdomain3D::width = 256

CALDomain3D::heigh = 256

(making domain Thread Group count  eq. to  1024).

 

Does it mean that Thread Groups will be allocated 1 per SIMD in round robin fashion through SIMDs, and if one of the SIMDs will take longer time to execute (due-to code branch, for example) it will slow down whole dispatch procedure, possibly waiting on SIMD to finish before scheduling rest of SIMDs, or dispatcher schedule SIMDs on available basis?

 

0 Likes

Micah,

Can you please comment at least for B.1 question above?

Thanks.

0 Likes

A thread group can contain more than a single wavefront. A wavefront is a hardware construct, a thread group is a software construct and it is not a 1 to 1 correspondence.
0 Likes

Ok, I see.

Therefore, whether

a) calCtxRunProgramGrid call will schedule work only for 1 SIMD (assigning gridBlock*gridSize threads to choosen SIMD),

b) or single calCtxRunProgramGrid call will destribute load accross SIMDs (gridBlock per SIMD), and schedule rest of gridBlocks as long as SIMDs become free?

(I.e. Shall I call calCtxRunProgramGrid in the loop to load all SIMDs in the GPU?)

Also,

Whether shared registers (sr# one) will be persistent over kernel invocations only when kernels called from calCtxRunProgramGrid only as well?

Micah, Thanks for all the things you been posted so far!

 

 

0 Likes

As long as there is work to be finished and there is space for a thread group to be scheduled on a SIMD, the GPU will schedule work.
0 Likes

So, as a result of single calCtxRunProgramGrid call, GPU will schedule runs among ALL available SIMDs, not on just single SIMD. Correct?

0 Likes