10 Replies Latest reply on Dec 28, 2013 12:11 PM by realhet

    VLIW 5 Architecture processing element

    arvin99

      I am really confused about the architecture of VLIW. I already read AMD APP Programming Guide. I am understand in the part of GCN architecture (Southern Island Device).

      In GCN, work item map into processing element (16 PE in each SIMD and there are four SIMD in one compute unit) and wavefront are different in each SIMD array.

      It is easy to understand that to make 64-element vector called a wavefront, it will need 4  cycles (since a quarter of different  four wavefront (16 work item) is filled in each SIMD array for each cycle)

      But it is difficult for me to understand the architecture of VLIW.

       

      From AMD APP Programming Guide on Chapter 7 Performance and Optimization for Evergreen and Northern Island Device:

      The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing element (PE).

      Each processing element contains a five-way (or four-way, depending on the GPU type) VLIW processor.

      Individual work-items execute on a single processing element; one or more work-groups execute on a single compute unit.

      On a GPU, hardware schedules the work-items.

      On the ATI Radeon™ HD 5000 series of GPUs, hardware schedules groups of work-items, called wavefronts, onto stream cores; thus, work-items within a wavefront execute in lock-step;

      the same instruction is executed on different data.




      What is processing element in VLIW?? Is it 16 PE  inside SIMD or  64 ALUs (16 x 4 ALUs / VLIW instructions) ??

      If processing element are 64 ALUs  then work item can mapped to 64 processing elements, then why it need four cycles to make wavefront (64 ALU -> 64 work item----->it is already full wavefront size)  ??

      It is difficult to understand because  the documentation use many terms like ALU, processing element, and stream core.

        • Re: VLIW 5 Architecture processing element
          nou

          there are 16 processing elements in one CU. the 4/5 ALU are executing vectorized operations from same work item. so if you write float4 a,b,c; c=a+b; it will execute with single VLIW instruction. on GCN it will take four instruction.

            • Re: VLIW 5 Architecture processing element
              arvin99

              Thanks for reply, Nou

              So, in VLIW architecture, work item is mapped to processing element (total = 16) where single work item can do multiple same instruction, right??

              I have two more questions:

               

              1. Can VLIW architecture  execute more than 1 wavefront (wavefront = 16)  if there is no dependency between the wavefronts  in one cycle (the book say that 16 work item executed in each cycle)??  It looks like impossible from VLIW but from the link (Graphics Core Next: The Southern Islands Architecture - AMD Radeon HD 7970: Promising Performance, Paper-Launched), it show that more than one wavefront can be executed in one cycle. One cycle from VLIW must be from one wavefront is n't it?? Or my concept is false??

               

              The image is taken from the link :Graphics Core Next: The Southern Islands Architecture - AMD Radeon HD 7970: Promising Performance, Paper-Launched

               

              2. How about float8? How VLIW and GCN architecture execute it ?

                • Re: VLIW 5 Architecture processing element
                  nou

                  that is incorrect use of word wave-front. first should be talking about instructions A-O. second one how it compiler pack inefficient to six VLIW instructions. VLIW stand for Very Long Instruction Word. so you don't have simple ADD, SUB, MUL instructions but ADDSUBSUBMUL instructions which executes operations from single work items. that images are inaccurate because for VLIW it assume that D is dependent on A-C but for GCN D is executed with A-B. if it could execute independently like it show in GCN examples the compiler would pack A,B,D,E to one VLIW instruction. more precise would be that GCN can take four work-groups and execute it on single CU in parallel. don't forget that 64 work-group get executed in four cycles but VLIW can't. VLIW is really about vectorized code when you can operate on float4 or longer data types.

                   

                  flaot8 are divided to half on VLIW so it execute in two instructions.

              • Re: VLIW 5 Architecture processing element
                realhet

                Hi,

                 

                Processing Element: is a single workitem in OpenCL terminology. That's what your OpenCL program is all about: one 'thread'.

                And the way that how the hardware realizes it, is another thing:

                On VLIW the compiler will optimize your code for 5 or 4 paralell 32bit processors. It's like a superscalar x86 processor which have more than one executing units but the main difference is that the allocation of those 4 or 5 execution units are resolved in compile time, not in run time. On VLIW5 you have to write your program to be able to vectorized into 4-5 execution units. If your code is a nothing but a long dependency chain, then only 1 execution unit can be utilized.

                Then came VLIW 4 (HD6xxx) which get rid of the fifth complicated alu and made the remaining four alu-es able to execute complicated instructions like cosine or bit_field_insert. But still you had to vectorize your program to 4 paralell executions.

                And finally there is GCN where every threads (or PEs) of your program will be executed by one 32 bit alu, so you don't have to make your code to be paralell at all, you can feed a big dependency chain into each 16 wide SIMD: It will do it within 4 cycles of latency as there is a 4 length pipeline. So thats 16 instructions per SIMD per cycle. In one CU there are 4 SIMD-es. In a HD7970 there are 32 CU-es, so the only thing you have to think about is to give 4x more work than your GPU's stream_count. So for a HD7970 8K threads is the minimum. On VLIW there is only 2x pilpelining, so there you can issue half amount of threads, but at with 4-5x vectorized code.

                You can check VLIW utilization by seeing the disassembled ISA code: If all the x,y,z,w,t cores have work to do, then the utilization is 100% (and if your program utilizes all 5 alu-es all the time, then make sure that your 5xxx have proper cooling )

                 

                Hope I haven't complicated things even further.

                  • Re: VLIW 5 Architecture processing element
                    arvin99

                    Thanx for reply,

                    Hmm, so in conclusion work item is mapping to processing element.

                    If I declare float a,b,c and write formula  a = b*c, there will be one operation execute on single ALU for each processing element in one cycle.

                    But If I declare float4 a,b,c and write formula  a = b*c, there will be four operation execute on four ALU for each processing element in one cycle.

                    Am I right??

                    How about using float8 in OpenCL?? each processing element just have 4/5 ALU in VLIW architecture.

                    Is ALU will execute the rest of operation in the next cycle???

                      • Re: VLIW 5 Architecture processing element
                        realhet

                        Float4 or float8 is just a high level thing. It lets your OpenCL program look nicer.

                         

                        First your OpenCL code (with float8s for example) will be compiled to AMD_IL language (which has float4 type too, but no float8).

                        And after that the final compilation occurs to machine ISA. It this step, the whole code is 'unpacked' to simple int, float or double math. It is the AMD_IL compiler which will assign these basic arithmetic operations to the actual VLIW5/4 or GCN execution units.

                         

                        On VLIW5 your calculation will be compiled to this:

                        x: a=b*c; y:nop; z:nop; w:nop; t:nop;   //only one execution unit has a job to do, the rest are sleeping.

                        (there are 5 execution units working on the same workitem: x,y,z,w and the special one: t)

                        If your program contains another math instruction which is independent of the previous on, then it can be executed in the Y unit in the same cycle.

                        On VLIW4 there is no T unit, but XYZW can do complicated instructions.

                        On GCN there is only one execution unit (which is working on a workitem).

                         

                        They are grouping these things: VLIW5 16 pieces of xyzwt units make a simd engine. VLIW4: 16 pieces of xyzw. GCN: 64 simple units.

                         

                        How about float4:

                        VLIW5: x: a0=b0*c0; y=a1=b1*c1; z: a2=b2*c2; w=a3=b3*c3; t:does nothing    //1 cycle total

                        VLIW4: x: a0=b0*c0; y=a1=b1*c1; z: a2=b2*c2; w=a3=b3*c3;                          //1 cycle total

                        GCN:

                        cycle0: a0=b0*c0

                        cycle1: a1=b1*c1

                        cycle2: a2=b2*c2

                        cycle3: a3=b3*c3

                         

                        It took 4x longer on GCN for a single workitem but GCN also have 4x more 'things'. It just a different layout. GCN's benefit is that it can execute long dependency chains 4x effectively as VLIW. That means 4x less registers, and 4x shorter program code.

                         

                        float4, float8, etc: is an easy way to eliminate dependency chains in your program. However you can do this manually too, the compiler will unvectorize your code anyway. And only VLIW needs reduced instruction dependency.

                         

                        For example this is a poorly efficient VLIW instruction stream: It takes 5 cycles.

                              0  y: LSHL        ____,  R1.x,  8                          

                              1  w: ADD_INT     ____,  R0.x,  PV0.y                 dependency of PV0.y (PreviousValue of Y unit in the 0 cycle)

                              2  z: LSHL        ____,  PV1.w,  2                        dependency of PV1.w

                              3  y: ADD_INT     ____,  PV2.z,  KC0[0].x           dependency of PV2.z

                              4  x: LSHR        R1.x,  PV3.y,  2                        dependency of PV3.y

                        so this is a long dependency chain on VLIW.

                         

                        And this is how a 5 cycles of fully utilized VLIW4 code looks like:

                            169 ALU: ADDR(1043) CNT(121)

                                256  x: ADD_INT     ____,  -1,  R18.y     

                                     y: ADD         R1.y,  R27.z,  R41.x     

                                     z: MUL_UINT24  R0.z,  R18.y,  4     

                                     w: ADD_INT     R1.w,  R18.y,  1     

                                257  x: ADD_INT     ____,  1,  PV256.z     

                                     y: LSHL        ____,  PV256.z,  2     

                                     z: ADD_INT     ____,  2,  PV256.z     

                                     w: MAX_INT     R0.w,  PV256.x,  0.0f     

                                258  x: LDS_WRITE   ____,  PV257.y,  R26.z     

                                     y: LSHL        R0.y,  PV257.z,  2     

                                     z: ADD_INT     ____,  3,  R0.z      VEC_021

                                     w: LSHL        ____,  PV257.x,  2     

                                259  x: LDS_WRITE   ____,  PV258.w,  R26.y     

                                     y: MUL_UINT24  ____,  R0.w,  4     

                                     z: MIN_INT     ____,  R1.w,  63      VEC_120

                                     w: LSHL        R0.w,  PV258.z,  2     

                                260  x: LDS_WRITE   ____,  R0.y,  R14.x     

                                     y: MUL_UINT24  R0.y,  PV259.z,  4     

                                     z: ADD_INT     ____,  3,  PV259.y     

                                     w: ADD_INT     ____,  2,  PV259.y

                        1 of 1 people found this helpful
                          • Re: VLIW 5 Architecture processing element
                            arvin99

                            Thanks, It become more clear, realhet.

                            Just to make sure, so if my kernel have 8 simple instructions (just float) and I have three wavefronts with size 64  (wavefront A, B and C run on VLIW) then:

                            a) If all of instructions are independent,there will be total 24 cycle clock

                               AAAABBBBCCCCAAAABBBBCCCC

                             

                            b) If all of instructions are dependent (very rare case) in each other,there will be total 96 cycle clock

                               AAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCCAAAABBBBCCCC

                             

                            Am I right??

                             

                            And if it need just one cycle clock to execute an instruction, why we must make minimum two wavefronts to hide eight cycles  latency of read-after write in VLIW??

                            Is it because fetch-decode-execute that take more than one cycle??

                            The process that take place at ALU just instruction execution  isn't it?? 

                            How many cycle clock for an instruction to do its job (include fetch, decode, execute, and write back to memory) in AMD hardware??

                            Is it 5 cycle clock for instruction ADD (according to table Instruction Throughput in AMD APP Programming Guide)??

                              • Re: VLIW 5 Architecture processing element
                                realhet

                                IMO that TomsHW article is quiet misleading.

                                It says something of wavefront dependency, and presents an awfully bad utilization example on the VLIW and a very good one on the GCN.

                                 

                                If I get it right in the article, it draws the VLIW4 processor as a 16x4 array.

                                The 16 width represents the 16 lanes. And the 4 height is for the X, Y, Z, W execution units respectively. Here comes the 64 stream processors in total.

                                But a wavefront in the gpu is not stands for "set of instructions" but it is 64 workitems bundled togethet which is the smallest work unit that a SIMD core can handle.

                                The SIMD core is only 16 wide and not 64 wide as a workitem. But there is 4x pipelining so in every cycle (not the gpu rate) a quarter of the current wavefront can be processed.

                                 

                                That 'wavefront' dependency visualisation is a bit weird.

                                What it describes as 'wavefront A,B,C...' that could be instructions I guess. ("Sometimes, it turns out that an instruction set, called a wavefront, can’t execute until another wavefront has been resolved." An instruction set, which is called a wavefront?!!!)

                                And there we can see an example of how bad is an InOrder queue when it handles dependencies.

                                But in real there is OutOfOrder execution and most importantly the X,Y,Z,W scheduling is resolved in COMPILE time. It the machine code that tells the 4 units (xyzw), that what they should do in every cycle.

                                With a given instruction stream: abcdefghijklmno, and the given  bc, ef, fg, kl dependencyes the offline compiler will simply write a machine code that reorders execution like this: abde, cfhi, gjkl, mno (If i'm right).

                                It's not worse than the GCN because that example so lightly dependent that even VLIW4 can handle it with 100% efficiency.

                                 

                                Please read this: "AMD_Evergreen-Family_Instruction_Set_Architecture 1.0d.pdf" Chapter 3 and 4 will describe precisely how things are going inside VLIW.

                                Also here's an article which is technically outstanding (imo): AMD's Cayman GPU Architecture

                                1 of 1 people found this helpful