there are 16 processing elements in one CU. the 4/5 ALU are executing vectorized operations from same work item. so if you write float4 a,b,c; c=a+b; it will execute with single VLIW instruction. on GCN it will take four instruction.
Thanks for reply, Nou
So, in VLIW architecture, work item is mapped to processing element (total = 16) where single work item can do multiple same instruction, right??
I have two more questions:
1. Can VLIW architecture execute more than 1 wavefront (wavefront = 16) if there is no dependency between the wavefronts in one cycle (the book say that 16 work item executed in each cycle)?? It looks like impossible from VLIW but from the link (Graphics Core Next: The Southern Islands Architecture - AMD Radeon HD 7970: Promising Performance, Paper-Launched), it show that more than one wavefront can be executed in one cycle. One cycle from VLIW must be from one wavefront is n't it?? Or my concept is false??
The image is taken from the link :Graphics Core Next: The Southern Islands Architecture - AMD Radeon HD 7970: Promising Performance, Paper-Launched
2. How about float8? How VLIW and GCN architecture execute it ?
that is incorrect use of word wave-front. first should be talking about instructions A-O. second one how it compiler pack inefficient to six VLIW instructions. VLIW stand for Very Long Instruction Word. so you don't have simple ADD, SUB, MUL instructions but ADDSUBSUBMUL instructions which executes operations from single work items. that images are inaccurate because for VLIW it assume that D is dependent on A-C but for GCN D is executed with A-B. if it could execute independently like it show in GCN examples the compiler would pack A,B,D,E to one VLIW instruction. more precise would be that GCN can take four work-groups and execute it on single CU in parallel. don't forget that 64 work-group get executed in four cycles but VLIW can't. VLIW is really about vectorized code when you can operate on float4 or longer data types.
flaot8 are divided to half on VLIW so it execute in two instructions.
Thanks Nou, you explain the image clearly.
But I still don't understand about float8.
If I use float4 then there are four ALU that filled with same operation in each processing element (one cycle), is n't it??
About float8, then GPU will fetch, decode, and execute two instructions (each instruction is float4) in different cycles??
yes float8 will be executed in two cycles with two instructions on VLIW architecture.
Processing Element: is a single workitem in OpenCL terminology. That's what your OpenCL program is all about: one 'thread'.
And the way that how the hardware realizes it, is another thing:
On VLIW the compiler will optimize your code for 5 or 4 paralell 32bit processors. It's like a superscalar x86 processor which have more than one executing units but the main difference is that the allocation of those 4 or 5 execution units are resolved in compile time, not in run time. On VLIW5 you have to write your program to be able to vectorized into 4-5 execution units. If your code is a nothing but a long dependency chain, then only 1 execution unit can be utilized.
Then came VLIW 4 (HD6xxx) which get rid of the fifth complicated alu and made the remaining four alu-es able to execute complicated instructions like cosine or bit_field_insert. But still you had to vectorize your program to 4 paralell executions.
And finally there is GCN where every threads (or PEs) of your program will be executed by one 32 bit alu, so you don't have to make your code to be paralell at all, you can feed a big dependency chain into each 16 wide SIMD: It will do it within 4 cycles of latency as there is a 4 length pipeline. So thats 16 instructions per SIMD per cycle. In one CU there are 4 SIMD-es. In a HD7970 there are 32 CU-es, so the only thing you have to think about is to give 4x more work than your GPU's stream_count. So for a HD7970 8K threads is the minimum. On VLIW there is only 2x pilpelining, so there you can issue half amount of threads, but at with 4-5x vectorized code.
You can check VLIW utilization by seeing the disassembled ISA code: If all the x,y,z,w,t cores have work to do, then the utilization is 100% (and if your program utilizes all 5 alu-es all the time, then make sure that your 5xxx have proper cooling )
Hope I haven't complicated things even further.
Thanx for reply,
Hmm, so in conclusion work item is mapping to processing element.
If I declare float a,b,c and write formula a = b*c, there will be one operation execute on single ALU for each processing element in one cycle.
But If I declare float4 a,b,c and write formula a = b*c, there will be four operation execute on four ALU for each processing element in one cycle.
Am I right??
How about using float8 in OpenCL?? each processing element just have 4/5 ALU in VLIW architecture.
Is ALU will execute the rest of operation in the next cycle???
1 of 1 people found this helpful
Float4 or float8 is just a high level thing. It lets your OpenCL program look nicer.
First your OpenCL code (with float8s for example) will be compiled to AMD_IL language (which has float4 type too, but no float8).
And after that the final compilation occurs to machine ISA. It this step, the whole code is 'unpacked' to simple int, float or double math. It is the AMD_IL compiler which will assign these basic arithmetic operations to the actual VLIW5/4 or GCN execution units.
On VLIW5 your calculation will be compiled to this:
x: a=b*c; y:nop; z:nop; w:nop; t:nop; //only one execution unit has a job to do, the rest are sleeping.
(there are 5 execution units working on the same workitem: x,y,z,w and the special one: t)
If your program contains another math instruction which is independent of the previous on, then it can be executed in the Y unit in the same cycle.
On VLIW4 there is no T unit, but XYZW can do complicated instructions.
On GCN there is only one execution unit (which is working on a workitem).
They are grouping these things: VLIW5 16 pieces of xyzwt units make a simd engine. VLIW4: 16 pieces of xyzw. GCN: 64 simple units.
How about float4:
VLIW5: x: a0=b0*c0; y=a1=b1*c1; z: a2=b2*c2; w=a3=b3*c3; t:does nothing //1 cycle total
VLIW4: x: a0=b0*c0; y=a1=b1*c1; z: a2=b2*c2; w=a3=b3*c3; //1 cycle total
It took 4x longer on GCN for a single workitem but GCN also have 4x more 'things'. It just a different layout. GCN's benefit is that it can execute long dependency chains 4x effectively as VLIW. That means 4x less registers, and 4x shorter program code.
float4, float8, etc: is an easy way to eliminate dependency chains in your program. However you can do this manually too, the compiler will unvectorize your code anyway. And only VLIW needs reduced instruction dependency.
For example this is a poorly efficient VLIW instruction stream: It takes 5 cycles.
0 y: LSHL ____, R1.x, 8
1 w: ADD_INT ____, R0.x, PV0.y dependency of PV0.y (PreviousValue of Y unit in the 0 cycle)
2 z: LSHL ____, PV1.w, 2 dependency of PV1.w
3 y: ADD_INT ____, PV2.z, KC0.x dependency of PV2.z
4 x: LSHR R1.x, PV3.y, 2 dependency of PV3.y
so this is a long dependency chain on VLIW.
And this is how a 5 cycles of fully utilized VLIW4 code looks like:
169 ALU: ADDR(1043) CNT(121)
256 x: ADD_INT ____, -1, R18.y
y: ADD R1.y, R27.z, R41.x
z: MUL_UINT24 R0.z, R18.y, 4
w: ADD_INT R1.w, R18.y, 1
257 x: ADD_INT ____, 1, PV256.z
y: LSHL ____, PV256.z, 2
z: ADD_INT ____, 2, PV256.z
w: MAX_INT R0.w, PV256.x, 0.0f
258 x: LDS_WRITE ____, PV257.y, R26.z
y: LSHL R0.y, PV257.z, 2
z: ADD_INT ____, 3, R0.z VEC_021
w: LSHL ____, PV257.x, 2
259 x: LDS_WRITE ____, PV258.w, R26.y
y: MUL_UINT24 ____, R0.w, 4
z: MIN_INT ____, R1.w, 63 VEC_120
w: LSHL R0.w, PV258.z, 2
260 x: LDS_WRITE ____, R0.y, R14.x
y: MUL_UINT24 R0.y, PV259.z, 4
z: ADD_INT ____, 3, PV259.y
w: ADD_INT ____, 2, PV259.y
Thanks, It become more clear, realhet.
Just to make sure, so if my kernel have 8 simple instructions (just float) and I have three wavefronts with size 64 (wavefront A, B and C run on VLIW) then:
a) If all of instructions are independent,there will be total 24 cycle clock
b) If all of instructions are dependent (very rare case) in each other,there will be total 96 cycle clock
Am I right??
And if it need just one cycle clock to execute an instruction, why we must make minimum two wavefronts to hide eight cycles latency of read-after write in VLIW??
Is it because fetch-decode-execute that take more than one cycle??
The process that take place at ALU just instruction execution isn't it??
How many cycle clock for an instruction to do its job (include fetch, decode, execute, and write back to memory) in AMD hardware??
Is it 5 cycle clock for instruction ADD (according to table Instruction Throughput in AMD APP Programming Guide)??
1 of 1 people found this helpful
IMO that TomsHW article is quiet misleading.
It says something of wavefront dependency, and presents an awfully bad utilization example on the VLIW and a very good one on the GCN.
If I get it right in the article, it draws the VLIW4 processor as a 16x4 array.
The 16 width represents the 16 lanes. And the 4 height is for the X, Y, Z, W execution units respectively. Here comes the 64 stream processors in total.
But a wavefront in the gpu is not stands for "set of instructions" but it is 64 workitems bundled togethet which is the smallest work unit that a SIMD core can handle.
The SIMD core is only 16 wide and not 64 wide as a workitem. But there is 4x pipelining so in every cycle (not the gpu rate) a quarter of the current wavefront can be processed.
That 'wavefront' dependency visualisation is a bit weird.
What it describes as 'wavefront A,B,C...' that could be instructions I guess. ("Sometimes, it turns out that an instruction set, called a wavefront, can’t execute until another wavefront has been resolved." An instruction set, which is called a wavefront?!!!)
And there we can see an example of how bad is an InOrder queue when it handles dependencies.
But in real there is OutOfOrder execution and most importantly the X,Y,Z,W scheduling is resolved in COMPILE time. It the machine code that tells the 4 units (xyzw), that what they should do in every cycle.
With a given instruction stream: abcdefghijklmno, and the given bc, ef, fg, kl dependencyes the offline compiler will simply write a machine code that reorders execution like this: abde, cfhi, gjkl, mno (If i'm right).
It's not worse than the GCN because that example so lightly dependent that even VLIW4 can handle it with 100% efficiency.
Please read this: "AMD_Evergreen-Family_Instruction_Set_Architecture 1.0d.pdf" Chapter 3 and 4 will describe precisely how things are going inside VLIW.
Also here's an article which is technically outstanding (imo): AMD's Cayman GPU Architecture