Infos about AMDs next GPU architecture:
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
Changes:
Wouldn't the WF/SIMD depend on the register usage?
So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.
It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.
Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.
...maybe the next GPU series will also be produced in 28 nm? mopo -> more power.
...a german news site mentioned that it is unlikely that "GCN" will be introduced in 2011:
http://www.heise.de/newsticker/meldung/GPU-Architektur-AMD-will-Nvidia-das-Fuerchten-lehren-1262833.html
--
Srdja
Originally posted by: ryta1203It will be interesting to see now how efficient AMD GPUs are at math intense operations, if their advantage remains or if it decreases... if it decreases close to where Nvidia performs, there will be little to no reason to use AMD, since Nvidia's software is so much more mature.
Also, will the added hardware scheduler they are losing die space so the theoretical peak should go down but hopefully the practical peak will go up!? Eh.
I'm curious about this too, also, what happens with the performance of kernels that operates on components (x, y, z)? In Cayman not just there was enough registers but also thoses kernels packed well, how this new one will handle regiester pressure?
It seems to me that in order to get enough power they are going to have to clock the SIMDs higher, which they could do with a 28nm process, along with having more die space.
Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something.
16 SIMD processors = 16 instr/clock
16TP*4VLIW = 64 instr/clock
This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.
I just don't see this new design competing with Cayman in arithmetic intense algorithsm (AMD's previous strong suite). For memory bound problems, currently, it's certainly a best option to use CUDA cards.
I suggest reading a concurrently running topic:
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=139142&enterthread=y
Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same
Originally posted by: ryta1203 So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.
Originally posted by: ryta1203 Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock
16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.
Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).
Dravisher is right, raw power per CU stays roughly the same, but if your problem allows you to lauch enough wavefronts, ALUs can more easily reach 100% load.
Eventually it will definately require more transistors, but on new process, that is not impossible. I have pointed out before, that it would be great if new process would not just add more raw power, but functionality. Here it is. I think it looks good.
i am understanding it right that now you need workgroup with size 64 which map to one wavefront which is executed in four ticks per 16 items.
with this you need workgroup of size 256 which is executed in 4 wavefronts on this four 16 wide SIMD blocks?
as workgroup i refer to a OpenCL workgroup.
and maybe one CU will execute wavefronts from multiple workgroups to keep ALU busy?
Originally posted by: dravisher Did your impression change ryta1203 or did I misunderstand something? First you said the numer of processors stays the same
Originally posted by: ryta1203 So the number of "processors" essentially stays the same as the 4 wide VLIW but with less restrictions? That sounds like a good thing.
Then that it's only 1/4?
Originally posted by: ryta1203 Essentially you are going to be doing 16 SIMD instructions per clock cyle versus 64 VLIW instructions (4 wide) per clock cyle, or maybe I'm missing something. 16 SIMD processors = 16 instr/clock
16TP*4VLIW = 64 instr/clock This is per Compute Unit; however, it's certainly possible that they will have more compute units per device, since it would seem likely they have some extra space now going away from VLIW, though I could be wrong.
Anyway, the GCN (Graphics Core Next) Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Cayman has 16 4-wide VLIW processing elements for a total of 16x4=64 operations in parallel, while the new architecture has 4 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor that Cayman does not. The difference is basically that GCN does not need instruction level parallelism, each of the four 16-wide vector units execute a different wavefront (the whole 64-sized wavefront taking four cycles). So the theoretical floating point power stays roughly the same per CU, but GCN should be more efficient since it does not require instruction level parallelism (but it presumably costs some more area/transistors as well).
Yes, 64, 4 SIMD/CU with 16 TP/SIMD vs. 16 TP/SIMD with 4 VLIW processors. What I was trying to get at was the overall mumber of processors on the device, I think this will be less, that is my assumption, just looking at Nvidia's solution too. I apologize if I used the term "processor" interchangably, I shouldn't have, you are correct.