What is the relationship between Compute Units, Stream Cores, Processing Elements and ALU?
The definition of them has already been answered in
But description of Stream Cores doesn't match the Device Parameters. For example, HD6850 has 12 Compute Units, 192 Stream Cores, 960 Processing Elements. How to explain that?
Also I'm a little confused by the wavefront. Documents says that
That instruction is repeated over four cyclesto make the 64-element vector called a wavefront
Is wavefront constructed by 16 ALU (4 PE) by repeating 4 times or constructed by 64 ALU (16 PE)?
Also because PE comes from vector unit, does scalar unit work in GPGPU? How do they work?
Compute Unit can be considered equivalent to cores in CPU. A workgroup in OpenCL is assigned a Compute Unit, and the workgroup would use the resources provided by ComputeUnit like LDS, Private registers, I & D caches etc. Stream Cores and ALUs are same, and there are 64 of these in both GCN and VLIW4 cards, only difference being that they are arranged in a different manner. Processing Element used to be 16 uptill VLIW4/5, and each of them used to be a 4/5 width SIMD. (So 16 PEs, and 64 ALUs). With GCN, there are 4 separate SIMDs , and they are 16-way wide. It can be considered that GCN has 4 PEs, but not sure about that.
More explanation from GCN whitepaper:
Compute units are the basic computational building block of the GCN Architecture. These CUs implement an entirely new instruction set that is much simpler for compilers and software developers to use and delivers more consistent performance than previous designs.
The shader arrays in earlier generations of AMD GPUs consisted of a number of SIMD engines, each of which consisted of up to 16 ALUs. Each ALU could execute bundles of 4 or 5 independent instructions co-issued in a VLIW (Very Long Instruction Word) format, with the shader compiler being largely responsible for scheduling and finding co-issue opportunities. SIMD engines were issued groups of 64 work items, called wavefronts, and would execute one wavefront at a time. This design aligned well with a data flow pattern is very common in graphics processing (for manipulating RGBA color values in a pixel shader, for example), making it possible to sustain high levels of utilization in most cases. However, the underlying data formats can be more complex and unpredictable for general purpose applications, making it more difficult to consistently find sets of 4 or 5 independent operations that could execute in parallel every cycle and keep the processing resources fully utilized.In GCN, each CU includes 4 separate SIMD units for vector processing. Each of these SIMD units simultaneously executes a single operation across 16 work items, but each can be working on a separate wavefront. This places emphasis on finding many wavefronts to be processed in parallel, rather than relying on the compiler to find independent operations within a single wavefront
HD6850 is a VLIW5 model that meand every 4 vector alu's have a fifth transcendetal alu (it does trigonometry and multimedia stuff, so the 4 vector core can be simpler).
So in this particular context a 'Stream Core' is this 4+1 group. 192 Of these gives the total 960 Processing Elements. (Which are called streams in the specifications).
On GCN it takes 4 clocks to execute an instruction in a 4 staged pipeline.
The GCN has 4 vector SIMD units. Each SIMD is 16*32bit wide and capable to handle a full 64 element wavfront in it's 4 pipeline stages.
The 64bit scalar unit is also has a 4 staged pipeline. There is 1 Scalar unit is every 4 vector SIMD units. So while the 4 SIMDes processes 4 wavefronts (that's 4*64 Processing elements) the Scalar alu can also execute 4 instructions. This lets the GCN chip to execute 1 64lane vector instruction in the same time as it is executes 1 64bit scalar instruction, this is the most extreme case.
On older VLIW chips there is no Scalar alu, they can do only simple flow_control (no goto, just if/else, loop and return) and they have dedicated hardware for constants. In GCN these tasks are given to the flexible Scalar alu which have a cache for constants.
Still very very confused...
Does 'Stream Core' has two definition. One is equal to ALU, the other is a brand new name for VLIW5 only?
So, FOR GCN, a CU contains 4 SIMDs. Each SIMD contain 16 PUs. Each PU contains 4 ALU?
For VLIW, I cannot do the decompose. Is SIMD just like second definition of Stream Core?
This is the problem with using so many different terms in different ways.
If you think in CPU terms, the GPU CU is a core. The 7970 has 32 cores. Each of those cores has 4 vector units (which are analogous to the AVX pipes on the CPU). Each of those vector units has 16 lanes/ALUs/processing elements (which sometimes get called stream cores). The VLIW-4 chips are basically the same but instructions are issued differently. The 6970 had 24 cores, I think, each one had four vector units and each vector unit was 16 lanes wide. The difference was that instructions were issued in compiler-generated packets instead of from four different wavefronts, and as a result we generally core one ALU from each of the 4 SIMD units a single stream core, because it is treated that way by the tool flow. So in a sense a stream core on the 6970 is the same as four stream cores on the 7970, and this is purely a matter of naming rather than of capability.
On the 7970 there is also a scalar unit. You can't directly access this from OpenCL (or indeed from AMDIL or HSAIL). Scalar operations are extracted from the instruction stream by the compiler chain. Each OpenCL work-item usually maps to one ALU/PE/Stream core on the 7970. As realhet mentions without the scalar unit there is only a simple flow control unit to do control flow so that gotos and arbitrary control flow aren't feasible.