cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

smatovic
Adept II

Webarticle about AMDs Graphics Core Next - GCN

AMDs Graphics Core Next - GCN

Infos about AMDs next GPU architecture:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

Changes:

  • non-VLIW Design
  • 16 wide SIMD Units
  • 4 SIMD Units / Compute Unit
  • 10 Wavefronts / SIMD Unit
  • 64 KB registers / SIMD Unit
Tags (2)
0 Likes
11 Replies
ryta1203
Journeyman III

Webarticle about AMDs Graphics Core Next - GCN

Wouldn't the WF/SIMD depend on the register usage?

So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.

0 Likes
smatovic
Adept II

Webarticle about AMDs Graphics Core Next - GCN

...maybe the next GPU series will also be produced in 28 nm? mopo -> more power.

...a german news site mentioned that it is unlikely that "GCN" will be introduced in 2011:

http://www.heise.de/newsticker/meldung/GPU-Architektur-AMD-will-Nvidia-das-Fuerchten-lehren-1262833.html

 

 

--

Srdja

 

0 Likes
eduardoschardong
Journeyman III

Webarticle about AMDs Graphics Core Next - GCN

Originally posted by: ryta1203It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.



I'm curious about this too, also, what happens with the performance of kernels that operates on components (x, y, z)? In Cayman not just there was enough registers but also thoses kernels packed well, how this new one will handle regiester pressure?

0 Likes
ryta1203
Journeyman III

Webarticle about AMDs Graphics Core Next - GCN

It seems to me that in order to get enough power they are going to have to clock the SIMDs higher, which they could do with a 28nm process, along with having more die space.

Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something.

16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock

This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

I just don't see this new design competing with Cayman in arithmetic intense algorithsm (AMD's previous strong suite). For memory bound problems, currently, it's certainly a best option to use CUDA cards.

0 Likes
Meteorhead
Challenger

Webarticle about AMDs Graphics Core Next - GCN

I suggest reading a concurrently running topic:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=139142&enterthread=y

0 Likes
dravisher
Journeyman III

Webarticle about AMDs Graphics Core Next - GCN

Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

   



Then that it's only 1/4?

Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

 

   



Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

0 Likes
Meteorhead
Challenger

Webarticle about AMDs Graphics Core Next - GCN

Dravisher is right, raw power per CU stays roughly the same, but if your problem allows you to lauch enough wavefronts, ALUs can more easily reach 100% load.

Eventually it will definately require more transistors, but on new process, that is not impossible. I have pointed out before, that it would be great if new process would not just add more raw power, but functionality. Here it is. I think it looks good.

0 Likes
nou
Exemplar

Webarticle about AMDs Graphics Core Next - GCN

i am understanding it right that now you need workgroup with size 64 which map to one wavefront which is executed in four ticks per 16 items.

with this you need workgroup of size 256 which is executed in 4 wavefronts on this four 16 wide SIMD blocks?

as workgroup i refer to a OpenCL workgroup.

and maybe one CU will execute wavefronts from multiple workgroups to keep ALU busy?

0 Likes
ryta1203
Journeyman III

Webarticle about AMDs Graphics Core Next - GCN

Originally posted by: dravisher Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

   



Then that it's only 1/4?

Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

 

   



Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

Yes, 64, 4 SIMD/CU with 16 TP/SIMD vs. 16 TP/SIMD with 4 VLIW processors. What I was trying to get at was the overall mumber of processors on the device, I think this will be less, that is my assumption, just looking at Nvidia's solution too. I apologize if I used the term "processor" interchangably, I shouldn't have, you are correct.

0 Likes