3 Replies Latest reply on May 16, 2011 10:20 AM by desertman909

    Need help understanding VLIW4/5

    TheyDroppedMe
      Need some clarification for my senior thesis

      Hey everyone,

      First of all, I'd like to introduce myself - my name is Greg and I'm a student at Georgia Tech writing my senior thesis. I've been searching for and reading some articles about exactly how ATI/AMD's VLIW architecture works (wikipedia, anandtech, bit-tech, etc) but there are still a few things that I'm a little unclear about and I'm hoping that someone here will be able to answer for me.

      Here's what I do know so far (using Cayman's VLIW4 architecture for this example): the chip's shaders are organized into blocks of SIMD cores, each of which is comprised of 16 VLIW4 bundles. Each VLIW4 bundle has 4 identical execution units (EU's), usually referred to as X, Y, Z, and W elements. That gives each SIMD core a total of 64 execution units -- giving the Radeon HD 6970, which has 24 SIMD cores IIRC, a total of 1536 EU's.

      Now from what I can understand, each SIMD core can execute independent threads, correct? I'm thinking that each SIMD core is seperate in the way that cpu cores are separate (independent threads and execution, but with shared cache and other higher-level hardware), or am I mistaken? Also, I am assuming that the "SIMD" in SIMD core refers to Single Instruction, Multiple Data. So does the grouping of execution hardware into SIMD cores makes the GPU effectively a MIMD/SIMD hybrid (in that it can do multiple SIMD operations simultaneously)?

      Ok, with that out of the way, it's on to my next question. From what I understand, the main advantage of VLIW is that it takes advantage of instruction-level parallelism (which is just non-dependent instructions that can be executed simultaneously, right?). So each VLIW4 bundle is able to execute 4 different instructions simultaneously. Now, I know that they could just all execute the same instruction on different data, just becoming a whole hell of a lot of SIMD EU's. So my question is, seeing that the VLIW4 bundle is MIMD by definition, is the SIMD core named so because all 16 VLIW4 bundles receive the same instructions? In that case, wouldn't a SIMD core be able to execute 4 different "threads"/operations at a time, with one EU from each VLIW4 bundle executing a different thread?

      However, in trying to learn some OpenCL, I keep hearing that the GPU is a SIMD engine, and that you must be able to divide your function into non-dependent operations that can be applied to multiple data simultaneously in order for it to be effective. But from what I've been able to determine, in the case of the 6970, there are 24 seperate SIMD "cores", each of which is theoretically capable of executing 4 different threads/operations (each thread on 16 EU's), for a grand total of 96 different simultaneous threads/operations (each thread being executed on 16 EU's). So in the sense that each thread (probably the wrong term, so sorry - I'm not a CS major) is executed on 16 different EU's, regardless of if it has 16 sets of data to operate on or just 1, it is SIMD.

      But when thinking of the GPU as a whole, isn't it a bit of a misnomer to refer to it as SIMD; shouldn't it really be described as MIMD divided into SIMD compute units (again, sorry probably the wrong term - I'm referring to the symmetrical EU's in a SIMD core, such as all the X EU's in one SIMD Core), where each compute unit does the same instruction on 16 sets of data?

      I'm probably wrong on a lot of things and terms here, so please excuse me as I'm not a CS major, but I would REALLY appreciate any help you could provide, either as direct answers to my questions, or as links to articles that describe it in more detail.

      Thanks in advance for everything and sorry about the length of the post - it's hard to be concise when you don't completely understand something.

       

      -Greg

       

      *Also, on a more personal line of curiosity, I know that the Very Long Instruction Words must be generated by the "compiler" that I keep hearing about, but what exactly is the compiler? I'm guessing that it is a part of the driver that creates the VLIW's in JIT fashion from the hardware calls, or is it something else? Is it some sort of pre-compiled microcode that AMD updates with preset translations from the hardware calls to VLIW instructions? Basically, what do they mean when they refer to the compiler? If it's a software JIT compiler in the driver, how much does it impact CPU performance (e.g. does Radeon 5870, 6970, etc graphics performance depend on CPU speed, or more than Geforce does - but only purely gpu performance). If it does work in real-time in the driver and does take significant CPU cycles (5-10%), then wouldn't it be more suited to an FPGA that can be reconfigured as AMD optimizes their compiler?

      P.P.S. Again, sorry for this being so long. I'm just a very inquisitive person by nature; it's not enough for me to KNOW how something works, I have to UNDERSTAND how and why something works. My parents and babysitter used to call me the Why Monster when I was a kid because I asked about everything lol

        • Need help understanding VLIW4/5
          nou

          each SIMD core has one branch unit. so when you have branch in your code and the path that code takes diverege across one workgroup then SIMD core must take both path. so whole SIMD core indeed must execute same instruction.

          different SIMD core can execute different part of same program as currently in OpenCL you can execute only one program at the time.

          but GPU have capability to execute multiple program in parralel but it is not exposed yet.

          also becuase of memory access latencies one SIMD core ca switch between multiples workgroups.

          • Need help understanding VLIW4/5
            bridgman

            Hi Greg;

            Here are some quick answers. I'm sure they will result in more questions

            >>Now from what I can understand, each SIMD core can execute independent threads, correct? I'm thinking that each SIMD core is seperate in the way that cpu cores are separate (independent threads and execution, but with shared cache and other higher-level hardware), or am I mistaken?

            Yes, but the SIMDs are "less independent" than a CPU core. From a graphics perspective you might have many SIMDs but they are all working together in a single graphics pipeline -- at any given instant you might have 10 SIMDs processing vertices and 14 SIMDs processing pixels (ignoring geometry shaders, hull shaders, domain shaders etc..) but they are all working on different bits of the same workload. Each SIMD has its own program counter but you don't get to control each SIMD independently. This might be a bit different for compute shaders, I'm just learning about them now

            >>Also, I am assuming that the "SIMD" in SIMD core refers to Single Instruction, Multiple Data. So does the grouping of execution hardware into SIMD cores makes the GPU effectively a MIMD/SIMD hybrid (in that it can do multiple SIMD operations simultaneously)?

            Well, I guess strictly speaking it's a MIMD made up of SIMDs made up of VLIWs, but that's too hard to pronounce... so "yeah".

            >>Ok, with that out of the way, it's on to my next question. From what I understand, the main advantage of VLIW is that it takes advantage of instruction-level parallelism (which is just non-dependent instructions that can be executed simultaneously, right?). So each VLIW4 bundle is able to execute 4 different instructions simultaneously.

            Yes, but if you want to understand VLIW quickly think about working on the RGBA components of a pixel - VLIW4 allows all four components of the pixel to be processed at the same time picking different fields out of the same (wide) registers, which makes for a very efficient implementation.

            >>Now, I know that they could just all execute the same instruction on different data, just becoming a whole hell of a lot of SIMD EU's. So my question is, seeing that the VLIW4 bundle is MIMD by definition, is the SIMD core named so because all 16 VLIW4 bundles receive the same instructions? In that case, wouldn't a SIMD core be able to execute 4 different "threads"/operations at a time, with one EU from each VLIW4 bundle executing a different thread?

            Yes, a SIMD runs the same instruction on 16 different datasets at the same time. Could be 16 vertices, or 16 pixels (4 2x2 quads), or 16 matrix elements etc...

            Strictly speaking a 16-way SIMD works on 64 different pieces of data (a wavefront) in 4 clocks, IIRC.

            >>However, in trying to learn some OpenCL, I keep hearing that the GPU is a SIMD engine, and that you must be able to divide your function into non-dependent operations that can be applied to multiple data simultaneously in order for it to be effective. But from what I've been able to determine, in the case of the 6970, there are 24 seperate SIMD "cores", each of which is theoretically capable of executing 4 different threads/operations (each thread on 16 EU's)

            Don't think about the 4 instructions in VLIW as 4 separate threads, that way lies madness -- think about workloads where you can extract some ILP and take advantage of that in the hardware by scheduling and executing more than one instruction at a time (think about a matrix calculation where the each element of the result matrix involves data from 8 other matrices, ie a = (b*c) + (d*e) + (f*g) + (h*i) - you can execute the 4 multiplies independently EVEN WHEN WORKING ON A SINGLE DATA RESULT)

            >>, for a grand total of 96 different simultaneous threads/operations (each thread being executed on 16 EU's). So in the sense that each thread (probably the wrong term, so sorry - I'm not a CS major) is executed on 16 different EU's, regardless of if it has 16 sets of data to operate on or just 1, it is SIMD.

            24 somewhat independent SIMDs, each executing the same "instruction" on 16 different chunks of data, where that instruction is VLIW and may actually include 4 different operations in parallel

            >>But when thinking of the GPU as a whole, isn't it a bit of a misnomer to refer to it as SIMD; shouldn't it really be described as MIMD divided into SIMD compute units (again, sorry probably the wrong term - I'm referring to the symmetrical EU's in a SIMD core, such as all the X EU's in one SIMD Core), where each compute unit does the same instruction on 16 sets of data?

            I don't think anyone refers to the GPU itself as a SIMD, it's a pile of SIMDs flying in loose formation.

            >>*Also, on a more personal line of curiosity, I know that the Very Long Instruction Words must be generated by the "compiler" that I keep hearing about, but what exactly is the compiler? I'm guessing that it is a part of the driver that creates the VLIW's in JIT fashion from the hardware calls, or is it something else? Is it some sort of pre-compiled microcode that AMD updates with preset translations from the hardware calls to VLIW instructions? Basically, what do they mean when they refer to the compiler?

            Yep, it's a JIT compiler used by the DirectX driver, the OpenGL driver, the OpenCL driver and a couple of other drivers as well. It goes from "il" (which is documented as part of the APP SDK) to hardware instructions. We usually call it "the shader compiler" for historical reasons.

            >>If it's a software JIT compiler in the driver, how much does it impact CPU performance (e.g. does Radeon 5870, 6970, etc graphics performance depend on CPU speed, or more than Geforce does - but only purely gpu performance). If it does work in real-time in the driver and does take significant CPU cycles (5-10%), then wouldn't it be more suited to an FPGA that can be reconfigured as AMD optimizes their compiler?

            Remember that shaders (programs running on the SIMDs) are typically compiled once when the app starts up then executed a bazillion or two times before the app shuts down. There are a few exceptions to this but they are pretty rare so normally the shader compiler execution time only affects application startup. There is also a precompilation option to avoid the JIT step where necessary.