6 Replies Latest reply on Sep 1, 2011 10:02 AM by himanshu.gautam

    Vector types, VLIW, and wavefront size

    settle

      Suppose a standard saxpy kernel using float returns 64 from a call to clGetKernelWorkGroupInfo(CL_PREFERRED_WORK_GROUP_MULTIPLE, ...).  If ones takes the same kernel and just changes float to float4, would the same query return 16 or 64?

       

      Also, does float4 gets striped across (f.x, f.y, f.z, f.w execute concurrently) the 4 or 5 lanes (processing elements) in the 16 SIMD units in each compute units.  Or does float4 get stacked onto a single lane (f.x, f.y, f.z, f.w execute sequentially), with different float4s executing similarly on the other lanes?

       

      And if the wavefront size is 64, how can VLIW5 ever be fully utilized?  VLIW4 times four cycles yields 64, but VLIW5 times four cycles yields 80.  I'm really confused about how this works.  It makes me think VLIW5 should have a wavefront size of 80, not 64.

        • Vector types, VLIW, and wavefront size
          MicahVillmow
          Settle,
          The width of the VLIW unit is the number of ALU instructions that can be executed in parallel by a single work-item(i.e. software thread). The wavefront size is the number of work-items that are executing in parallel in a single hardware thread.
          • Vector types, VLIW, and wavefront size
            Bdot

            Use the kernel analyzer and have a look at the assembly. From experience with my kernels, the compiler will reorganize the statements a lot whenever there are no dependencies. Therefore, for one operation, f.x, f.y, f.z and f.w may be executed along with a fifth, totally unrelated operation, the next vector operation may be distributed over 4 cycles (together with other, unrelated operations). That also depends if the operation can be performed by all units. For instance, convert_float4 and convert_uint4, when converting float to uint vectors and back, need to run in the t unit of the SIMD, thus they will be executed sequentially. If the compiler cannot find suitable operations for the other slots, then they will be unused in that cycle, which is of course something to look at when aiming for top performance.

              • Vector types, VLIW, and wavefront size
                fpaboim

                I'm no expert but I believe the 5 and 4 after the "VLIW" stands for the number of "math units", the cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units. Since branching is hadled separately in the stream processor, you could say the stream processor itself is the "ALU" with 4 AU units + 1 SFU (in the cypress case) + 1 LU(logical unit). Since we have 16 stream processors per compute unit we the VLIW width is 16. 4 work items can be pipelined so thats why AMD recommends a wavefront size of 64.

                  • Vector types, VLIW, and wavefront size
                    rick.weber

                     

                    Originally posted by: fpaboim I'm no expert but I believe the 5 and 4 after the "VLIW" stands for the number of "math units", the cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units. Since branching is hadled separately in the stream processor, you could say the stream processor itself is the "ALU" with 4 AU units + 1 SFU (in the cypress case) + 1 LU(logical unit). Since we have 16 stream processors per compute unit we the VLIW width is 16. 4 work items can be pipelined so thats why AMD recommends a wavefront size of 64.

                     

                    The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.

                    Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).

                      • Vector types, VLIW, and wavefront size
                        settle

                         

                        Originally posted by: rick.weber
                        Originally posted by: fpaboim Cypress had 4 AUs(arithmetic units) + 1 SFU(special function units) while the cayman has 4 AUs which can be rearranged when a SFU is needed to work as 1 AU + 1 SFU (3 AUs are grouped to serve as a SFU). I believe that SFUs in general just weren't being used that much (in relation to AUs that is, technically you still have one SFU per stream processor) so they preferred to do this and get little more space to pack more compute units.

                         

                         

                        The recommended wavefront size is 64 because there are 16 stream processors per compute unit and each stream processor interleaves 4 threads over 4 clock cycles.

                         

                        Each stream processor has 4 ALUs. This implies that code that has little instruction parallelism will use little of the available performance. In the worst case, you only use 1/4 of the available ALUs every clock cycle even with 64 threads. This is why float4s are highly recommended if you can get by with them, because they trivially expose independent calculations (in addition to higher bandwidth from global memory and textures).

                         

                        I think my misunderstanding originates from thinking the SPU in Cypress (VLIW5) could also execute add, sub, mul, etc. in what would seem to work best on float5.  However, if the SPU can only execute functions like sin, cos, exp, etc. then I understand why it's float4.  Can you confirm that the SPU in Cypress (VLIW5) cannot execute add, sub, mul, etc.?

                        Are there any cases where VLIW5 (4 ALUs + 1 SPU) could be fully used back to back over 4 cycles given that you said the wavefront size is 64?  Also, when you say interleave 4 threads over 4 cycles, what exactly is a thread (work-item?) and is that like a waterfall or in lock step?

                          • Vector types, VLIW, and wavefront size
                            himanshu.gautam

                            proper usage of SPU in cypress does not depend on the wavefront size.

                            To fully utilize VLIW5 of cypress you need to have enough Intruction level parallelism inside your kernel code. To be clear only a single kernel instance runs on a VLIW5.

                            Although I am not sure about the capabilities of the 5th processing unit in VLIW5.