cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arvin99
Adept II

How GCN scheduling work??

Hi, I am confused about how GCN scheduling work.

Let's say, if I define work group size is 32  and global work items is 262144 in OpenCL, then in four cycle clock there will be 32 PE (Processing Element)

( 16 PE in first clock cycle and 16 PE in second clock cycle) that worked and the other PE is idle for single wavefront right??

The total wavefront that worked will be 8192  wavefronts.

How about GCN ?? If I define work group size less than 64, are there idle PE or it is dynamically work for the next wavefront??

Can someone explain to me by using these two images ??


a) Taken from AnandTech Portal | AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute

    Wavefront Execution Example: SIMD vs. VLIW. Not To Scale - Wavefront Size 16

http://images.anandtech.com/doci/4455/Wavefront.png


b) Taken from AMD: Southern Islands (7*** series) Speculation/ Rumour Thread - Page 33 - Beyond3D Forum

    Copyright to Hiroshige Goto

0 Likes
1 Solution

Here's what happens where there are only 16 items in a group: The grayed areas aren't getting anything to do, they will be idle (physically the 64bit execution mask will disable them, only the first 16 items will be enabled). In the 4 stage pipeline there will be only one task at any given time, not 4.

16workitems.png

View solution in original post

0 Likes
7 Replies
developer
Adept II

Workgroup sizes < 64 will result in Idle Cycles.

- Bruhapati

Thanx for reply, so if wavefront size is 64 then from the image a, the idle cycle clock are cycle clock 2,3,and 4 right ??

0 Likes
realhet
Miniboss

The second image is kinda accurate.You can see that a wavefront (which always has 64 workitems, even when you specify only 32) has 4 cycle latency and processed in 4 clocks 16by16. Also there is a slow double precision instruction which takes 16 clocks. Notice the 4 SIMD units aren't in synchron: they're following each other with 1 clock latency. That's because the Scalar alu which is not on the image. The S alu starts an instruction in every clock for a different Vector alu and 4 cycles later itgives the result back. The S alu is also 4x pipelined and in every 4 clock it server all the 4 Vector SIMD ALUes.

If you see one clock line, you can sum up the performance of the whole unit: There you can issue 16*4 float32 MulAdds: that's 16*4*2=128 flops. With the S alu you can do an additional 64bit integer operation too. There are 32 SIMD Engines in the Tahiti, thus the whole card does 128*32 = 4096 flops / cycle (plus 32 int64 ops/cycle). The default clock is 925MHz so the total performance is 4096*925 = 3.7888 TFlops/s plus 29.6 Gops/s of int64 operations.

32 workitems in a workgroup -> that's way too low.

In order to achieve the above performance of a GCN you must give it at least number_of_streams(32*64=2048 on Tahiti) *4 workitems. And you must have 64,128,192 or 256 workitems in a workgroup. That's the minimum to have every execution unit work.

Thanks for reply, realhet.

I know that I must give large wok item for each work group with multiply of 64.

But, what I want to know actually is dynamic scheduling.

If I define work group size is 16, is there idle ALU or not??

In VLIW,  if I define work group size is 32  and global work items is 262144 in OpenCL, then in four cycle clock there will be 32 PE (Processing Element)

( 16 PE in first clock cycle and 16 PE in second clock cycle) that worked and the other PE is idle for single wavefront. The total wavefront that worked will be 8192  wavefronts.

How about GCN??

0 Likes

wavefront is smallest execution granularity and have size 64. any less workgroup size lead to under utilized HW.

0 Likes

Here's what happens where there are only 16 items in a group: The grayed areas aren't getting anything to do, they will be idle (physically the 64bit execution mask will disable them, only the first 16 items will be enabled). In the 4 stage pipeline there will be only one task at any given time, not 4.

16workitems.png

0 Likes

Hoo, thanx youu very much....

0 Likes