11 Replies Latest reply on Jun 21, 2011 8:13 AM by dravisher

    Webarticle about AMDs Graphics Core Next - GCN

    smatovic

      AMDs Graphics Core Next - GCN

      Infos about AMDs next GPU architecture:

      http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

      Changes:

      • non-VLIW Design
      • 16 wide SIMD Units
      • 4 SIMD Units / Compute Unit
      • 10 Wavefronts / SIMD Unit
      • 64 KB registers / SIMD Unit
        • Webarticle about AMDs Graphics Core Next - GCN
          ryta1203

          Wouldn't the WF/SIMD depend on the register usage?

          So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

          It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

          Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.

            • Webarticle about AMDs Graphics Core Next - GCN
              smatovic

              ...maybe the next GPU series will also be produced in 28 nm? mopo -> more power.

              ...a german news site mentioned that it is unlikely that "GCN" will be introduced in 2011:

              http://www.heise.de/newsticker/meldung/GPU-Architektur-AMD-will-Nvidia-das-Fuerchten-lehren-1262833.html

               

               

              --

              Srdja

               

              • Webarticle about AMDs Graphics Core Next - GCN
                eduardoschardong

                 

                Originally posted by: ryta1203It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.

                Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.



                I'm curious about this too, also, what happens with the performance of kernels that operates on components (x, y, z)? In Cayman not just there was enough registers but also thoses kernels packed well, how this new one will handle regiester pressure?

                  • Webarticle about AMDs Graphics Core Next - GCN
                    ryta1203

                    It seems to me that in order to get enough power they are going to have to clock the SIMDs higher, which they could do with a 28nm process, along with having more die space.

                    Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something.

                    16 SIMD processors = 16 instr/clock

                    16TP*4VLIW = 64 instr/clock

                    This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

                    I just don't see this new design competing with Cayman in arithmetic intense algorithsm (AMD's previous strong suite). For memory bound problems, currently, it's certainly a best option to use CUDA cards.

                      • Webarticle about AMDs Graphics Core Next - GCN
                        dravisher

                         

                        Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

                         

                        Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

                           



                        Then that it's only 1/4?

                         

                        Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

                        16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

                         

                           



                        Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

                          • Webarticle about AMDs Graphics Core Next - GCN
                            Meteorhead

                            Dravisher is right, raw power per CU stays roughly the same, but if your problem allows you to lauch enough wavefronts, ALUs can more easily reach 100% load.

                            Eventually it will definately require more transistors, but on new process, that is not impossible. I have pointed out before, that it would be great if new process would not just add more raw power, but functionality. Here it is. I think it looks good.

                              • Webarticle about AMDs Graphics Core Next - GCN
                                nou

                                i am understanding it right that now you need workgroup with size 64 which map to one wavefront which is executed in four ticks per 16 items.

                                with this you need workgroup of size 256 which is executed in 4 wavefronts on this four 16 wide SIMD blocks?

                                as workgroup i refer to a OpenCL workgroup.

                                and maybe one CU will execute wavefronts from multiple workgroups to keep ALU busy?

                              • Webarticle about AMDs Graphics Core Next - GCN
                                ryta1203

                                 

                                Originally posted by: dravisher Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same

                                 

                                Originally posted by: ryta1203  So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.

                                   



                                Then that it's only 1/4?

                                 

                                Originally posted by: ryta1203  Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock

                                16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.

                                 

                                   



                                Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).

                                Yes, 64, 4 SIMD/CU with 16 TP/SIMD vs. 16 TP/SIMD with 4 VLIW processors. What I was trying to get at was the overall mumber of processors on the device, I think this will be less, that is my assumption, just looking at Nvidia's solution too. I apologize if I used the term "processor" interchangably, I shouldn't have, you are correct.

                                  • Webarticle about AMDs Graphics Core Next - GCN
                                    Meteorhead

                                    My guess would be, that there will be a bit more processors on the die. The last fabrication process jump from 65nm to 40nm, the SIMD engines were doubled. basically two of the same chips fit into the same die. With this new architecture, a lot of new functionality has been brought (virtual address space, c++, dll, ...) but VLIW Cores were far from simple also. New functionality won't increase complexity to a degree that no additional raw power can be added.

                                    I think there will be roughly 25% increase in the number of processors, with the increased complexity. This architecture has a big tradeoff: namely you need a lot more threads to keep it busy. Of course, that is natural if the the processors increase, but this time you need 4X more than with "traditional" VLIW. Some lattice computations are very paralell, but no more than O(1000) threads can be used at a time, meaning, that although they are computehungry, multi-gpu is hard with Cayman, but with GCN it is out of the question.

                                      • Webarticle about AMDs Graphics Core Next - GCN
                                        dravisher

                                        I'm not so sure that we'll need 4x as many work-items to keep it busy. With Cayman we need two wavefronts per CU, with GCN we need four wavefronts. However (in my experience) at least four wavefronts are needed per CU in Cayman/Cypress to get decent performance, and it's not entirely clear that GCN will really need any more work-items at all in practice. I've asked a question about this in the other thread (whether GCN will require more work-items to hide memory latency).