cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arvin99
Adept II

program counter in instruction vs program counter in wavefront

Hi,

I have two questions.

As I know that each wavefront has its own program counter.

So,

1. what is the different between program counter for wavefront and program counter for each instruction??

2. what is the different between instruction pointer and program counter for each instruction?? (looks like they are same, but not sure 100%)


0 Likes
1 Solution


arvin99 wrote:



Thanks for reply, I already clear about the answer from question 2.


But your answer in question number 1 still make me confuse.



So, in GPU and CPU, there is only one program counter.


The differences are:


In CPU the process PC->PC +1 happened in each cycle clocks but in GPU the process PC-> PC +1 happened for each four cycle clocks (since GPU execute on a group of thread (wavefront)).



Am I right??


CPU

1. There is one program counter (PC).

2. If the CPU chip has multiple cores/threads, there is 1 PC per thread.

3. PC will advance after the instruction is finished, it can be 1 to 100+ clocks, depends on instruction. **

GPU

1. There is 1 PC per wave, each wave has 64 (usually) threads. There are many waves.

2. Almost all instructions advance the PC every 4 clocks.

3. On a CU, 4 waves execute in parallel, so the average time to execute 1 instruction is 1 clock, not 4.

View solution in original post

0 Likes
10 Replies
drallan
Challenger

1. The wavefront and instruction counters are the same, there is only one counter, which is helpful to know when programming. When a wave hits an if, while, or other branching statement, all 64 threads in a wavefront pass through the body of the statement, they don't really jump, with one exception. Threads that do not meet the condition do not save their results and so appear inactive. The exception is when all 64 threads have the same result in the conditional statement, i.e., all are true or all are false. Then the program counter branches if necessary. Unlike scalar programming, a conditional statement does not save time when at least one thread meets the condition.

2. Instruction pointer and program counter usually mean the same thing.

Thanks for reply, I already clear about the answer from question 2.

But your answer in question number 1 still make me confuse.

So, in GPU and CPU, there is only one program counter.

The differences are:

In CPU the process PC->PC +1 happened in each cycle clocks but in GPU the process PC-> PC +1 happened for each four cycle clocks (since GPU execute on a group of thread (wavefront)).

Am I right??

0 Likes


arvin99 wrote:



Thanks for reply, I already clear about the answer from question 2.


But your answer in question number 1 still make me confuse.



So, in GPU and CPU, there is only one program counter.


The differences are:


In CPU the process PC->PC +1 happened in each cycle clocks but in GPU the process PC-> PC +1 happened for each four cycle clocks (since GPU execute on a group of thread (wavefront)).



Am I right??


CPU

1. There is one program counter (PC).

2. If the CPU chip has multiple cores/threads, there is 1 PC per thread.

3. PC will advance after the instruction is finished, it can be 1 to 100+ clocks, depends on instruction. **

GPU

1. There is 1 PC per wave, each wave has 64 (usually) threads. There are many waves.

2. Almost all instructions advance the PC every 4 clocks.

3. On a CU, 4 waves execute in parallel, so the average time to execute 1 instruction is 1 clock, not 4.

0 Likes

Sorry, looks like there is a bit misunderstanding .

1. So, if i have four thread in CPU (two physical core but hyper-threading) , I will have four physical PC??

    And in GPU there are many pyshical PC, but if there are only eight wavefronts that are active then there will be eight PC work??

    

2. On a CU, 4 waves execute in parallel, so the average time to execute 1 instruction is 1 clock, not 4.

     Why 1 clock?? (4 clock + 4 + 4 + 4) / 4 = 4 cycle clock??

0 Likes

it highly HW depended how many PC are there. on CPU it is single register so each thread have one. wave-front is GPU equivalent of CPU thread. wave-front is executed in four cycles per 16 items. so one wave-front contain 64 items. to execute single instruction on whole wave-front takes four cycles.


arvin99 wrote:



Sorry, looks like there is a bit misunderstanding .


1. So, if i have four thread in CPU (two physical core but hyper-threading) , I will have four physical PC??


    And in GPU there are many pyshical PC, but if there are only eight wavefronts that are active then there will be eight PC work??


    


2. On a CU, 4 waves execute in parallel, so the average time to execute 1 instruction is 1 clock, not 4.


     Why 1 clock?? (4 clock + 4 + 4 + 4) / 4 = 4 cycle clock??



1. Yes, one CPU core with hyper-threading acts just like 2 CPUs with 2 PCs. Two cores will have 4 PCs. They do this by duplicating part of the core that holds the "state" of the processor, like the PC, registers, etc. The 2 threads are independent (2 PCs) so they easily share the execution hardware. It is fast.

Yes, a GPU running only 8 waves will have 8 PCs active. A GCN 7970 GPU can have up to 32 x 40 wavefronts running, or 1280 PCs.

2. On a GPU (GCN architecture) it takes 4 clocks to complete one instruction, but each execution unit (ALU) starts one new instruction every clock, the instructions execute at the same time, in parallel, in a pipeline.

So the total number of instructions finished by one ALU in 4 clocks = 4.

Thus (1+1+1+1 instructions)/(4 clocks) = 1. Average 1 instruction/clock per ALU unit.

The 7970 has 2048 ALUs that can execute 2048 typical instructions per clock.

0 Likes

In short, in VLIW, single ALU can execute multiple independent instructions for 4 cycle clock by one wavefront (another wavefront will be not execute instruction

( fetch))

In GCN single ALU can execute only one instruction and the instruction finish for 4 cycle clock like VLIW but there are four wavefronts can execute four different instructions at same time.

Am I right??  

0 Likes

when you change ALU to Compute Unit then yes you are right.on VLIW each CU contain 16 vector ALU which are 4/5 elements wide. on GCN there is 4 group of 16 scalar ALU. each group is executing one wave-front at a time.

VLIW instruction consist up to 4/5 operations. that mean you can have for example four add or two add and two mul etc.

0 Likes


arvin99 wrote:



In short, in VLIW, single ALU can execute multiple independent instructions for 4 cycle clock by one wavefront (another wavefront will be not execute instruction


( fetch))


In GCN single ALU can execute only one instruction and the instruction finish for 4 cycle clock like VLIW but there are four wavefronts can execute four different instructions at same time.



Am I right??



Basically yes, except VLIW takes 8 clocks to finish an instruction and make results available, so the VLIW engine is also pipelined. This a problem because the the next instruction cannot see the results of the previous one, the compiler has to schedule the instructions carefully. GCN does not have this problem.

As nou said, VLIW and GCN are quite different. An "ALU" in VLIW has 4 (5 for older ones) processors, a GCN "ALU" has 1 but more of them. But the ideas are about the same. The VLIW ALU is written as ALU.[WXYZ] or ALU.[WXYZT] for the 4/5 processors.

0 Likes

also single wavefront share single program counter. in case of if() and other branching the corresponding work-items are masked out.

0 Likes