cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arvin99
Adept II

VLIW 5 Architecture processing element

I am really confused about the architecture of VLIW. I already read AMD APP Programming Guide. I am understand in the part of GCN architecture (Southern Island Device).

In GCN, work item map into processing element (16 PE in each SIMD and there are four SIMD in one compute unit) and wavefront are different in each SIMD array.

It is easy to understand that to make 64-element vector called a wavefront, it will need 4  cycles (since a quarter of different  four wavefront (16 work item) is filled in each SIMD array for each cycle)

But it is difficult for me to understand the architecture of VLIW.

From AMD APP Programming Guide on Chapter 7 Performance and Optimization for Evergreen and Northern Island Device:

The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing element (PE).

Each processing element contains a five-way (or four-way, depending on the GPU type) VLIW processor.

Individual work-items execute on a single processing element; one or more work-groups execute on a single compute unit.

On a GPU, hardware schedules the work-items.

On the ATI Radeon™ HD 5000 series of GPUs, hardware schedules groups of work-items, called wavefronts, onto stream cores; thus, work-items within a wavefront execute in lock-step;

the same instruction is executed on different data.




What is processing element in VLIW?? Is it 16 PE  inside SIMD or  64 ALUs (16 x 4 ALUs / VLIW instructions) ??

If processing element are 64 ALUs  then work item can mapped to 64 processing elements, then why it need four cycles to make wavefront (64 ALU -> 64 work item----->it is already full wavefront size)  ??

It is difficult to understand because  the documentation use many terms like ALU, processing element, and stream core.

0 Likes
1 Solution

that is incorrect use of word wave-front. first should be talking about instructions A-O. second one how it compiler pack inefficient to six VLIW instructions. VLIW stand for Very Long Instruction Word. so you don't have simple ADD, SUB, MUL instructions but ADDSUBSUBMUL instructions which executes operations from single work items. that images are inaccurate because for VLIW it assume that D is dependent on A-C but for GCN D is executed with A-B. if it could execute independently like it show in GCN examples the compiler would pack A,B,D,E to one VLIW instruction. more precise would be that GCN can take four work-groups and execute it on single CU in parallel. don't forget that 64 work-group get executed in four cycles but VLIW can't. VLIW is really about vectorized code when you can operate on float4 or longer data types.

flaot8 are divided to half on VLIW so it execute in two instructions.

View solution in original post

0 Likes
10 Replies