cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

smatovic
Adept II

Webarticle about AMDs Graphics Core Next - GCN

AMDs Graphics Core Next - GCN

Infos about AMDs next GPU architecture:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

Changes:

  • non-VLIW Design
  • 16 wide SIMD Units
  • 4 SIMD Units / Compute Unit
  • 10 Wavefronts / SIMD Unit
  • 64 KB registers / SIMD Unit
0 Likes
11 Replies
ryta1203
Journeyman III

Wouldn't the WF/SIMD depend on the register usage?

So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.

0 Likes

...maybe the next GPU series will also be produced in 28 nm? mopo -> more power.

...a german news site mentioned that it is unlikely that "GCN" will be introduced in 2011:

http://www.heise.de/newsticker/meldung/GPU-Architektur-AMD-will-Nvidia-das-Fuerchten-lehren-1262833.html

 

 

--

Srdja

 

0 Likes

Originally posted by: ryta1203It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.



I'm curious about this too, also, what happens with the performance of kernels that operates on components (x, y, z)? In Cayman not just there was enough registers but also thoses kernels packed well, how this new one will handle regiester pressure?

0 Likes

It seems to me that in order to get enough power they are going to have to clock the SIMDs higher, which they could do with a 28nm process, along with having more die space.

Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something.

16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock

This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

I just don't see this new design competing with Cayman in arithmetic intense algorithsm (AMD's previous strong suite). For memory bound problems, currently, it's certainly a best option to use CUDA cards.

0 Likes

I suggest reading a concurrently running topic:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=139142&enterthread=y

0 Likes

Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

   



Then that it's only 1/4?

Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

 

   



Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

0 Likes

Dravisher is right, raw power per CU stays roughly the same, but if your problem allows you to lauch enough wavefronts, ALUs can more easily reach 100% load.

Eventually it will definately require more transistors, but on new process, that is not impossible. I have pointed out before, that it would be great if new process would not just add more raw power, but functionality. Here it is. I think it looks good.

0 Likes

i am understanding it right that now you need workgroup with size 64 which map to one wavefront which is executed in four ticks per 16 items.

with this you need workgroup of size 256 which is executed in 4 wavefronts on this four 16 wide SIMD blocks?

as workgroup i refer to a OpenCL workgroup.

and maybe one CU will execute wavefronts from multiple workgroups to keep ALU busy?

0 Likes

Originally posted by: dravisher Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

   



Then that it's only 1/4?

Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

 

   



Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

Yes, 64, 4 SIMD/CU with 16 TP/SIMD vs. 16 TP/SIMD with 4 VLIW processors. What I was trying to get at was the overall mumber of processors on the device, I think this will be less, that is my assumption, just looking at Nvidia's solution too. I apologize if I used the term "processor" interchangably, I shouldn't have, you are correct.

0 Likes

My guess would be, that there will be a bit more processors on the die. The last fabrication process jump from 65nm to 40nm, the SIMD engines were doubled. basically two of the same chips fit into the same die. With this new architecture, a lot of new functionality has been brought (virtual address space, c++, dll, ...) but VLIW Cores were far from simple also. New functionality won't increase complexity to a degree that no additional raw power can be added.

I think there will be roughly 25% increase in the number of processors, with the increased complexity. This architecture has a big tradeoff: namely you need a lot more threads to keep it busy. Of course, that is natural if the the processors increase, but this time you need 4X more than with "traditional" VLIW. Some lattice computations are very paralell, but no more than O(1000) threads can be used at a time, meaning, that although they are computehungry, multi-gpu is hard with Cayman, but with GCN it is out of the question.

0 Likes

I'm not so sure that we'll need 4x as many work-items to keep it busy. With Cayman we need two wavefronts per CU, with GCN we need four wavefronts. However (in my experience) at least four wavefronts are needed per CU in Cayman/Cypress to get decent performance, and it's not entirely clear that GCN will really need any more work-items at all in practice. I've asked a question about this in the other thread (whether GCN will require more work-items to hide memory latency).

0 Likes