Suppose a standard saxpy kernel using float returns 64 from a call to clGetKernelWorkGroupInfo(CL_PREFERRED_WORK_GROUP_MULTIPLE, ...). If ones takes the same kernel and just changes float to float4, would the same query return 16 or 64?
Also, does float4 gets striped across (f.x, f.y, f.z, f.w execute concurrently) the 4 or 5 lanes (processing elements) in the 16 SIMD units in each compute units. Or does float4 get stacked onto a single lane (f.x, f.y, f.z, f.w execute sequentially), with different float4s executing similarly on the other lanes?
And if the wavefront size is 64, how can VLIW5 ever be fully utilized? VLIW4 times four cycles yields 64, but VLIW5 times four cycles yields 80. I'm really confused about how this works. It makes me think VLIW5 should have a wavefront size of 80, not 64.