cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

liquid
Journeyman III

Why GCN architecture is better than VLIW4?

I understand that may be dependance between many wavefronts. But, while GCN can switch wavefronts dynamically between SIMDS, I don't understand why VLIW4 can't do this. In case this does not happen because VLIW4 is lack of buffers, queues or other hardware capable to handle the dynamic switching, why AMD did not add a buffer/queue/other harware instead of changing the entire architecture?

Thank you.

0 Likes
1 Solution

VLIW4 is a little modified VLIW5: on 5 there was 4 units for simple instructions (x,y,z,w) and a fifth one (t) for the complicated ones. On VLIW4 they dropped the T unit and upgraded all the remaining X,Y,Z,W units to be able to handle those complicated transcendental instuctions. So there left 4 equivalent units bunded in a VLIW.

GCN was designed from scratch and it's roughly like an x86 processor with 2048bit SSE support.

- You can program the scalar alu like a classic x86: you can change the program counter for example. In VLIW the only program flow elements was an IF/ELSE block, a LOOP block, and an EXIT instruction. On GCN you can have subroutines for example, this way much bigger programs can fit in the small 32KB instruction cache.

- The scalar ALU works in paralell with 64 element vector alu. It is possible to make a loop that wastes only 1 cycle for the loop management code. On VLIW the loop overhead can be 10-40 cycles long even.

- No complicated register access. On VLIW it was very complicated to feed 3*4 input parameters as it was read from 3x 16byte parts of the register bank.

- 50% smaller instruction encoding (there are 32bit instrictions too, not just 64bit ones). That's why the instruction cache was reduced from 48KB to 32KB. And less cache means more space for additional computing units

- As others said earlier: absolutely no need for code vectorization. Every 16wide SIMD unit will process 4*16 workitems (1 wavefront) in a pipeline with 4 stages. That is 2x more than in VLIW and that why GCN needs 2x more minimum workitems than VLIW. For a Tahiti it's a minimum of 8192 workitems.

- If you like to program in asm, GCN is much simpler to program than VLIW. Back then I haven't got enough courage for the extremely complicated VLIW asm, but this new language is even simpler than AMD_IL. It's very well designed, can't say a bad thing about it.

View solution in original post

7 Replies
nou
Exemplar

VLIW5/4 needs that code can be vectorized. but many computation task can't be vectorized which lead to underutilized HW.

0 Likes
liquid
Journeyman III

For example? I guess that a SIMD Engine (VLIW4) must create instruction a like this: SUM_MUL_SUM_SUB for wavefront A_B_C_D, one for each ALU in a VLIW processor. If a wavefront D is dependant to wavefront C, why cant a SIMD Engine can't swtich between D and F for example?

Thanks a lot.

0 Likes

VLIW instructions are generated during compilation time and GPU can't change them. they broke 16*4 wide SIMD to four 16 wide SIMD.

liquid
Journeyman III

I understand but, AMD simply could not have included some kind of hardware to handle that to continue with a little modified VLIW4 architecture?

0 Likes

Because the hardware is pipelined.  VLIW4 requires 4 instructions to be packed into a single ALU bundle.  This means the compiler has to do all the scheduling to make that work.  The hardware can't do it because when it sees an ALU bundle, everything has to work in a pipeline, so it's can't short-circuit the pipeline to switch to the next instruction.

This is why GCN is better: No VLIW to worry about so it's much easier to get peak ALU rates.

VLIW4 is a little modified VLIW5: on 5 there was 4 units for simple instructions (x,y,z,w) and a fifth one (t) for the complicated ones. On VLIW4 they dropped the T unit and upgraded all the remaining X,Y,Z,W units to be able to handle those complicated transcendental instuctions. So there left 4 equivalent units bunded in a VLIW.

GCN was designed from scratch and it's roughly like an x86 processor with 2048bit SSE support.

- You can program the scalar alu like a classic x86: you can change the program counter for example. In VLIW the only program flow elements was an IF/ELSE block, a LOOP block, and an EXIT instruction. On GCN you can have subroutines for example, this way much bigger programs can fit in the small 32KB instruction cache.

- The scalar ALU works in paralell with 64 element vector alu. It is possible to make a loop that wastes only 1 cycle for the loop management code. On VLIW the loop overhead can be 10-40 cycles long even.

- No complicated register access. On VLIW it was very complicated to feed 3*4 input parameters as it was read from 3x 16byte parts of the register bank.

- 50% smaller instruction encoding (there are 32bit instrictions too, not just 64bit ones). That's why the instruction cache was reduced from 48KB to 32KB. And less cache means more space for additional computing units

- As others said earlier: absolutely no need for code vectorization. Every 16wide SIMD unit will process 4*16 workitems (1 wavefront) in a pipeline with 4 stages. That is 2x more than in VLIW and that why GCN needs 2x more minimum workitems than VLIW. For a Tahiti it's a minimum of 8192 workitems.

- If you like to program in asm, GCN is much simpler to program than VLIW. Back then I haven't got enough courage for the extremely complicated VLIW asm, but this new language is even simpler than AMD_IL. It's very well designed, can't say a bad thing about it.

Thank you so much guys, I've understood correctly, I think.

0 Likes