First of all, I'd like to introduce myself - my name is Greg and I'm a student at Georgia Tech writing my senior thesis. I've been searching for and reading some articles about exactly how ATI/AMD's VLIW architecture works (wikipedia, anandtech, bit-tech, etc) but there are still a few things that I'm a little unclear about and I'm hoping that someone here will be able to answer for me.
Here's what I do know so far (using Cayman's VLIW4 architecture for this example): the chip's shaders are organized into blocks of SIMD cores, each of which is comprised of 16 VLIW4 bundles. Each VLIW4 bundle has 4 identical execution units (EU's), usually referred to as X, Y, Z, and W elements. That gives each SIMD core a total of 64 execution units -- giving the Radeon HD 6970, which has 24 SIMD cores IIRC, a total of 1536 EU's.
Now from what I can understand, each SIMD core can execute independent threads, correct? I'm thinking that each SIMD core is seperate in the way that cpu cores are separate (independent threads and execution, but with shared cache and other higher-level hardware), or am I mistaken? Also, I am assuming that the "SIMD" in SIMD core refers to Single Instruction, Multiple Data. So does the grouping of execution hardware into SIMD cores makes the GPU effectively a MIMD/SIMD hybrid (in that it can do multiple SIMD operations simultaneously)?
Ok, with that out of the way, it's on to my next question. From what I understand, the main advantage of VLIW is that it takes advantage of instruction-level parallelism (which is just non-dependent instructions that can be executed simultaneously, right?). So each VLIW4 bundle is able to execute 4 different instructions simultaneously. Now, I know that they could just all execute the same instruction on different data, just becoming a whole hell of a lot of SIMD EU's. So my question is, seeing that the VLIW4 bundle is MIMD by definition, is the SIMD core named so because all 16 VLIW4 bundles receive the same instructions? In that case, wouldn't a SIMD core be able to execute 4 different "threads"/operations at a time, with one EU from each VLIW4 bundle executing a different thread?
However, in trying to learn some OpenCL, I keep hearing that the GPU is a SIMD engine, and that you must be able to divide your function into non-dependent operations that can be applied to multiple data simultaneously in order for it to be effective. But from what I've been able to determine, in the case of the 6970, there are 24 seperate SIMD "cores", each of which is theoretically capable of executing 4 different threads/operations (each thread on 16 EU's), for a grand total of 96 different simultaneous threads/operations (each thread being executed on 16 EU's). So in the sense that each thread (probably the wrong term, so sorry - I'm not a CS major) is executed on 16 different EU's, regardless of if it has 16 sets of data to operate on or just 1, it is SIMD.
But when thinking of the GPU as a whole, isn't it a bit of a misnomer to refer to it as SIMD; shouldn't it really be described as MIMD divided into SIMD compute units (again, sorry probably the wrong term - I'm referring to the symmetrical EU's in a SIMD core, such as all the X EU's in one SIMD Core), where each compute unit does the same instruction on 16 sets of data?
I'm probably wrong on a lot of things and terms here, so please excuse me as I'm not a CS major, but I would REALLY appreciate any help you could provide, either as direct answers to my questions, or as links to articles that describe it in more detail.
Thanks in advance for everything and sorry about the length of the post - it's hard to be concise when you don't completely understand something.
*Also, on a more personal line of curiosity, I know that the Very Long Instruction Words must be generated by the "compiler" that I keep hearing about, but what exactly is the compiler? I'm guessing that it is a part of the driver that creates the VLIW's in JIT fashion from the hardware calls, or is it something else? Is it some sort of pre-compiled microcode that AMD updates with preset translations from the hardware calls to VLIW instructions? Basically, what do they mean when they refer to the compiler? If it's a software JIT compiler in the driver, how much does it impact CPU performance (e.g. does Radeon 5870, 6970, etc graphics performance depend on CPU speed, or more than Geforce does - but only purely gpu performance). If it does work in real-time in the driver and does take significant CPU cycles (5-10%), then wouldn't it be more suited to an FPGA that can be reconfigured as AMD optimizes their compiler?
P.P.S. Again, sorry for this being so long. I'm just a very inquisitive person by nature; it's not enough for me to KNOW how something works, I have to UNDERSTAND how and why something works. My parents and babysitter used to call me the Why Monster when I was a kid because I asked about everything lol