cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

foomanchoo
Adept I

concurrent execution of wavefront of 1 workgroup

good morning.

just to clarify: executing 4 workgroups with 1 wavefront each per CU is gonna be as fast as executing

1 workgroup of 256 work items? (ignoring minor overhead assiciated with workgroup scheduling).

i.e. is it ok to run 1 workgroup of size 256 per CU?

and if yes: if 4 wavefronts work on the the same 64 data paths, while sharing intermediate results via LDS,

the same work on 64 data paths will take 1/4 of the time compared to working on 256 data paths with one thread

doint all the work instead of 4 threads coordinating per data path via LDS? (ignoring the overhead that would

result from wavefronts not reaching barriers at exactly the same time).

also the wavefronts would be executing barrier instructions in synchronization with other threads but the wavefronts

would diverge and thus the barriers would not be at the same address in the instruction stream. is that legal?

thanks for helping.

0 Kudos
Reply
2 Replies
nou
Exemplar

concurrent execution of wavefront of 1 workgroup

no each work item in workgroup must hit SAME barrier. even that they hit same count of barrier doesn't is not enough.

0 Kudos
Reply
LeeHowes
Staff
Staff

Re: concurrent execution of wavefront of 1 workgroup

Further to nou's comment just one workgroup of 256 work items is not likely to efficiently occupy the device. Any cache miss or barrier is likely to leave you with gaps in the instruction stream.

There isn't any significant difference between four groups of 64 and one of 256 unless you get benefits from reuse between the wavefronts, or (in the negative direction) if the larger groups don't mean you fully occupy the CU, reducing the total number of wavefronts and the device's ability to cover for stalls.

0 Kudos
Reply