From AMD "stream computing user guide", 1.2.2 thread processing: to hide latencies up to four threads can do four VLIW over four cylces. For example, 16 TPs of one SE execute the same command with each TP processing four threads at a time, that results in 64-wide SIMD engine and has wavefront size of 64 threads.
Also, 16 TPs of one SE have 16*5 80 cores (shaders by GPU-Z), up to five operations can be done by one VLIW of one TP.
Really, too many "four" words, and i can not guess the concrete combinations of 4, 16 and 64.
1. Is "64-wide SIMD engine" the same as "wavefront size of 64"?
2. "four threads can do four VLIW over four cylces" - can be different instructions for each cylce or data only?
3. "each TP processing four threads at a time" - the "four threads" appear due to "four float/int cores of TP" or due to "four VLIW over four cylces" = "one 'effective VLIW' over one cylce" = "one 'at a time' per four cylces"?