Archives Discussions

grizlyk · ‎06-25-2011

what is the wavefront

Hi.

From AMD "stream computing user guide", 1.2.2 thread processing: to hide latencies up to four threads can do four VLIW over four cylces. For example, 16 TPs of one SE execute the same command with each TP processing four threads at a time, that results in 64-wide SIMD engine and has wavefront size of 64 threads.

Also, 16 TPs of one SE have 16*5 80 cores (shaders by GPU-Z), up to five operations can be done by one VLIW of one TP.

Really, too many "four" words, and i can not guess the concrete combinations of 4, 16 and 64.

1. Is "64-wide SIMD engine" the same as "wavefront size of 64"?
2. "four threads can do four VLIW over four cylces" - can be different instructions for each cylce or data only?
3. "each TP processing four threads at a time" - the "four threads" appear due to "four float/int cores of TP" or due to "four VLIW over four cylces" = "one 'effective VLIW' over one cylce" = "one 'at a time' per four cylces"?

Meteorhead · ‎06-26-2011

I am not familiar with the abreviations you use, but let me clarify it for you (although I am 100% sure this has been discussed in at least 3 other topics).

One Compute Unit (CU) features 16 pieces of 4-wide VLIW processors. These are the Processing Elements (PE), ot as you can find them on marketing papers, Stream Cores. (Every piece of the VLIW is called a Stream Processor, but let's forget this for the moment)

The basis of thread packing is a wavefront, which holds 64 threads (or in OpenCL terms, workitems). If workgroup size (WGS) is larger than 64, it will be subdivided into groups of 64 (totally transparent to the programmer) and if it's smaller, it will be extended by dummy threads (yet again transparently).

The magic numbers come from the following idea: one thread in one tick of core clock executes 4 instructions at the same time, each instruction going on one lane of the 4-wide VLIW processor. VLIW means that the 4 instructions can differ, but naturally they operate on different data. (Data independancy is required, read more about it on wiki) VLIW is SIMD like, but much more flexible.

The reason why one CU is called a SIMD engine is because it has 16 VLIW processors, and although VLIW processors have different instructions down the 4 lanes, the 16 processors do completely identical operations.

The reason why wavefront size is 64, is because it takes 4 clock cycles to access a register. Thus each wavefront is is divided into 4 smaller parts, which are 16 threads in size to fit onto the 16 Stream Cores.

This architecture though it about to change, you can check relevant topics which also discuss the present architecture through the differences with coming HD7000 architecture.

Hope I answered all your questions.

grizlyk · ‎07-03-2011

Originally posted by: Meteorhead I am not familiar with the abreviations you use

TPs means ThreadProcessorS
SE means Simd Engine
"cylces" is copy&paste repeted error of first word "cycles".

file CAL 1.4.0_beta\Stream_Computing_User_Guide.pdf, can be obtained here: http://developer.amd.com/archive/gpu/ATIStreamSDKv1.4Beta/Pages/default.aspx

Hope I answered all your questions.

Really, it's not.

One Compute Unit (CU) features 16 pieces of 4-wide VLIW processors.

Let us assume "4-wide VLIW processor" is the same as TP (figure 1.9 simplified block diagram of the stream processor).

1. The question is
"the TP is 4-wide because of 4 separated instruction (for example, piece of abstract opcode stream: 1- add, 2- mov, 3- sub, 4- mul) can be prefetched and executed during 4 nearest cycles, the execution is not shown on the flat (x,y space) 'figure 1.9', but if we add 3 TPs for each TP into Z-dimension, the 3 added TPs will show us 4 virtual TPs per one hardware TP because of '4 threads are executing at a time in different stages of execution and they share the same cores (only one core executed by a one clock tick and use ALU)'" (i think it is true) or
"the TP is 4-wide because of 4 int/float cores per TP are exist, we no need z-dimension for 'figure 1.9'"?

2. It is not true, that any hardware has SE with 16 TPs, the total number of cores must 80*n in the case ( n - number of SEs ), for example HD 5450 is 80*1, HD 5570 is 80*5 etc, but HD2400 has 40 cores 40/5 = 8, means 1 SE with 8 or 2 SE with 4 etc (there are no details of HD2400), but i have guessed at the time of writing the words, that 16 wavefront of HD2400 can point to "2 SE with 4 TPs" and 32 wavefront of HD2600 means "3 SE with 8 TPs".

One Compute Unit

The next guess is "Compute Unit" is not the same as true hardware processor, means there are two levels of abstraction, the upper is related to real hardware, but has size of wavefron fixed to 64, and for simple hardware the "software wavefront" will be mapped to lower level of hardware abstration by CAL (transparently). Is it true or not (i think not) and CAL wavefront is exactly hardware wavefront?

VLIW means that the 4 instructions can differ, but naturally they operate on different data.

I think if instructions are different, they are really different, as add, mov, etc, and if the differences is data only, the instructions are the same instruction, but longer size (more data processed), so the "4-wide VLIW processor" means longer size instruction or has 4 normal size separated intsructions?

I am not familiar with CAL, it is possible i need to read all existed data befor questions, but it is not easy to understand an execution of programs if hardware abstraction is not known.

As result of OP questions:

1. Is "64-wide SIMD engine" the same as "wavefront size of 64"?
I can not guess.

2. "four threads can do four VLIW over four cylces" - can be different instructions for each cylce or data only?
I can not guess.

3. "each TP processing four threads at a time" - the "four threads" appear due to "four float/int cores of TP" or due to "four VLIW over four cylces"
It's not answered.

Jawed · ‎07-09-2011

In current GPUs the SIMD has 16 lanes (though there are some cards with less, e.g. 8 lanes). The hardware defines a hardware thread ("wavefront") as consisting of 64 work items (OpenCL term) because on 4 successive clock cycles the same instruction is issued to sets of 16 work items. So in the first cycle work items 0-15 run the instruction. Then in the next cycle work items 16-31. Same for 32-47 and 48-63.

The document you are referring to is "out of date". The OpenCL terminology is preferable and so you should look at AMDs documents for that.

Each lane in current GPUs does a VLIW-5 instruction (on most of the chips) or a VLIW-4 instruction on the chip called Cayman (available in HD6950, HD6970 and HD6990).

A VLIW instruction can contain a mixture of operations: e.g. a VLIW-5 lane can do MUL, ADD, MAD, ADD and RCP (reciprocal) operations in one cycle. These 5 operations are then repeated for the next 3 cycles. All 16 lanes run exactly the same set of 5 operations.

So the SIMD over 4 clock cycles runs the same VLIW instruction on 64 work items. Each VLIW instruction consists of a maximum of 5 operations. So that's 320 operations in total over the 4 cycles per wavefront, or 80 operations per cycle.

Hope that helps.

Archives Discussions

stream processor "wavefront" term definition