I am really confused about the architecture of VLIW. I already read AMD APP Programming Guide. I am understand in the part of GCN architecture (Southern Island Device).
In GCN, work item map into processing element (16 PE in each SIMD and there are four SIMD in one compute unit) and wavefront are different in each SIMD array.
It is easy to understand that to make 64-element vector called a wavefront, it will need 4 cycles (since a quarter of different four wavefront (16 work item) is filled in each SIMD array for each cycle)
But it is difficult for me to understand the architecture of VLIW.
From AMD APP Programming Guide on Chapter 7 Performance and Optimization for Evergreen and Northern Island Device:
The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing element (PE).
Each processing element contains a five-way (or four-way, depending on the GPU type) VLIW processor.
Individual work-items execute on a single processing element; one or more work-groups execute on a single compute unit.
On a GPU, hardware schedules the work-items.
On the ATI Radeon™ HD 5000 series of GPUs, hardware schedules groups of work-items, called wavefronts, onto stream cores; thus, work-items within a wavefront execute in lock-step;
the same instruction is executed on different data.
What is processing element in VLIW?? Is it 16 PE inside SIMD or 64 ALUs (16 x 4 ALUs / VLIW instructions) ??
If processing element are 64 ALUs then work item can mapped to 64 processing elements, then why it need four cycles to make wavefront (64 ALU -> 64 work item----->it is already full wavefront size) ??
It is difficult to understand because the documentation use many terms like ALU, processing element, and stream core.