
Vector types, VLIW, and wavefront size
MicahVillmow Aug 30, 2011 11:05 AM (in response to settle)Settle,
The width of the VLIW unit is the number of ALU instructions that can be executed in parallel by a single workitem(i.e. software thread). The wavefront size is the number of workitems that are executing in parallel in a single hardware thread. 
Vector types, VLIW, and wavefront size
Bdot Aug 30, 2011 8:38 PM (in response to settle)Use the kernel analyzer and have a look at the assembly. From experience with my kernels, the compiler will reorganize the statements a lot whenever there are no dependencies. Therefore, for one operation, f.x, f.y, f.z and f.w may be executed along with a fifth, totally unrelated operation, the next vector operation may be distributed over 4 cycles (together with other, unrelated operations). That also depends if the operation can be performed by all units. For instance, convert_float4 and convert_uint4, when converting float to uint vectors and back, need to run in the t unit of the SIMD, thus they will be executed sequentially. If the compiler cannot find suitable operations for the other slots, then they will be unused in that cycle, which is of course something to look at when aiming for top performance.

Vector types, VLIW, and wavefront size
fpaboim Aug 31, 2011 7:22 PM (in response to Bdot)I'm no expert but I believe the 5 and 4 after the "VLIW" stands for the number of "math units", the cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units. Since branching is hadled separately in the stream processor, you could say the stream processor itself is the "ALU" with 4 AU units + 1 SFU (in the cypress case) + 1 LU(logical unit). Since we have 16 stream processors per compute unit we the VLIW width is 16. 4 work items can be pipelined so thats why AMD recommends a wavefront size of 64.

Vector types, VLIW, and wavefront size
rick.weber Aug 31, 2011 7:43 PM (in response to fpaboim)Originally posted by: fpaboim I'm no expert but I believe the 5 and 4 after the "VLIW" stands for the number of "math units", the cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units. Since branching is hadled separately in the stream processor, you could say the stream processor itself is the "ALU" with 4 AU units + 1 SFU (in the cypress case) + 1 LU(logical unit). Since we have 16 stream processors per compute unit we the VLIW width is 16. 4 work items can be pipelined so thats why AMD recommends a wavefront size of 64.
The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.
Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).

Vector types, VLIW, and wavefront size
settle Sep 1, 2011 9:39 AM (in response to rick.weber)Originally posted by: rick.weber
Originally posted by: fpaboim Cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units.
The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.
Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).
I think my misunderstanding originates from thinking the SPU in Cypress (VLIW5) could also execute add, sub, mul, etc. in what would seem to work best on float5. However, if the SPU can only execute functions like sin, cos, exp, etc. then I understand why it's float4. Can you confirm that the SPU in Cypress (VLIW5) cannot execute add, sub, mul, etc.?
Are there any cases where VLIW5 (4 ALUs + 1 SPU) could be fully used back to back over 4 cycles given that you said the wavefront size is 64? Also, when you say interleave 4 threads over 4 cycles, what exactly is a thread (workitem?) and is that like a waterfall or in lock step?

Vector types, VLIW, and wavefront size
himanshu.gautam Sep 1, 2011 10:02 AM (in response to settle)proper usage of SPU in cypress does not depend on the wavefront size.
To fully utilize VLIW5 of cypress you need to have enough Intruction level parallelism inside your kernel code. To be clear only a single kernel instance runs on a VLIW5.
Although I am not sure about the capabilities of the 5th processing unit in VLIW5.



