46 Replies Latest reply on Feb 2, 2012 6:19 PM by Meteorhead

    Future HW and SDK

    Meteorhead
      Questions about upcoming tech

      Hi, I have opened this topic to have a place for everyone to post questions about always the actual upcoming HW and SDK capabilities and proprties.

        • Future HW and SDK
          Meteorhead

          My first questions would be:

          - What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

          I ask because I fancy the thought of being able to write applications (and being able to see games) that run regularly on CPU, calculate physics, AI and other highly-paralell parts on the IGP inside the CPU, and use the GPU solely for graphics. Since APU stands for Accelerated Proccessing Unit, will the GPU inside Liano be a CL_DEVICE_TYPE_ACCELERATOR? It would be wise to make a distinction to devices that share their __global memory physically with the host (as Liano will do).

          - Will either radeon 6xxx cards or the new APUs support out-of-order exec?

          Out of order execution on GPU-s is useful, although hard to harness, but inside the APU it would be most useful, where if one uses OpenCL events smart, they could create massively optimal engines to games, where memory-handling, window-management, AI, physics, etc. could run wickedly fast.

          - How much effort would it take to have higher DP capacity and/or support for QP?

          I read somewhere how Radeon cards deal with DP operations, namely that 2 Stream cores are linked inside a vector-processor for the duration of the operation and the remaining 3 are non-operational for the time being. This is the reason DP capacity is 1/5 of SP. I do not know how NVIDIA implements DP, but since each CUDA core has a single INT and FP unit, I suspect there are 2 ways: some CUDA cores are native 64-bit, while others are not ; OR 32-bit INT and FP units do 64-bit operations at the cost of hidden register use. Since OpenCL inherently is able to query preferred vector widths at certain precisions, and Radeon SIMD engines are inherently capable of doing 64 (or even 128) bit operations with 32-bit shader processors via this linking, the question is the following: I know linking Stream Cores to do 64-bit operations takes up space inside the die, but how much more would it take, to have 4*1 for SP, 2*2 dor DP, and 1*4 for QP operations? Quadrupole precision might be something that is a lot harder to implement on NVIDIA cards with the usage of single execution units, and AMD could win quite a few customers in the GPGPU segment being first to support a healthy QP capacity on GPUs, but the same goes for solely the ability to link 2*2 Stream Cores to reach double the DP capacity. Radeon 6xxx series might not, but future 28nm GPUs might have the space on the SIMD engines to do the extra linking.

            • Future HW and SDK
              nou

              maybe they make new type CL_DEVICE_TYPE_APU.

              IMHOout-of-order is just SW implementation of queue. concurent running of multiple kernel is another story.

              each 5D unit can do MADD instruction which is counted as two FLOP. and with DP two and two units are linked together to perform two DP +-* operation. so one 5D unit can do 10 SP op/clock and 2 DP op/clock.

               

                • Future HW and SDK
                  bubu

                   

                  Originally posted by: Meteorhead

                  - What CL_DEVICE_TYPE will be the GPU inside the upcoming Liano APUs?

                  I bet Llano will expose 2 OpenCL devices, one typed as CPU and other typed as DX11 GPU.

                  - Will either radeon 6xxx cards or the new APUs support out-of-order exec?

                  I hope, as well as DMA transfers...

                   

                    • Future HW and SDK
                      nou

                      IMHO again DMA transfer is just limitation of current implementation. even 4xxx can do DMA transfer under CAL. IIRC some AMD stated that they are working on it.

                        • Future HW and SDK
                          Meteorhead

                          This is the part in the ATI OpenCL Computing Guide I have mentioned. So do I have it right, that when linking is done, no MADD operations are available, so one operation cannot be counted as 2 FLOPs. This quote is misleading in some way, it says "two or four are linked... to perform a SINGLE DP operation". Shouldn't it be 1 DP FLOP when linking two, and 2 DP FLOP when linking four?

                          But if this last is true, than DP capacity could only be increased by adding MADD capability under linked Processing Element mode. QP needs a little more linking, perhaps also the ability to deal with MADD operations.

                          If this is true though, that 2 DP operations can be dealt with at once, why does OpenCL report preferred DP vector width to be 1 with 5970?

                          A stream core is arranged as a five-way very long instruction word (VLIW) processor. Up to five scalar operations can be coissued in a VLIW instruction, each of which are executed on one of the corresponding five processing elements. Processing elements can execute single-precision floating point or integer operations. One of the five processing elements also can perform transcendental operations (sine, cosine, logarithm, etc.) Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental core) to perform a single double-precision operation. The stream core also contains one branch execution unit to handle branch instructions.

                            • Future HW and SDK
                              himanshu.gautam

                              hi all,

                              Nice to hear your thoughts.

                              meteorhead,

                              I confirm the bug in document.But i hope the issue has  been clarified by nou very well.

                              • Future HW and SDK
                                malcolm3141

                                I believe this is referred to in the Optimisation Guide - a DP add or sub requires two pipes (in other words two can be scheduled in one bundle), but a DP mul or fma takes all four pipes (and hence only one can be scheduled in each bundle).

                                Talking of future hardware, I would love to see AMD include 32bit multipliers in each of the xyzw pipes, and I could also see them provide enough hardware between two pipes to perform at least a DP mad or even better a full precision DP fma. To be able to claim >1TFlops DP performance from a single GPU would be amazing!

                                 

                                Malcolm

                                  • Future HW and SDK
                                    Meteorhead

                                    If I'm not mistaken, I recall AMD stating that it wishes to follow the APU approach on the Opteron front-line beside desktop solutions. It would be nice to hear some bits (or even more) information from these products. Is it only a plan to integrate the IGP into the CPU to reduce energy consumtion, or will there be processors with higher SIMD capacity?

                                    I am very much interested in every way parallel computing hardware can be neatly integrated into HPC clusters. I think all supercomputer owners (as well as those looking for HPC solutions) would welcome a way to have upgradeable HW, meaning an Opteron would include a maximum of 4 cores, and the rest of the die would be SIMD engines (and some cache). This way existing 1U racks could be reused for major upgrade in computing power.

                                    Right now the most neat and compact way of creating a GPU cluster would be the solutions offered by *beep*, where 1U rackmount can hold 2 double width GPUs. Only problem is, that the half width motherboard offered holds 2 processor slots. GPU clusters (in my opinion) don't need very powerful processors, only fast in RAM access, and mediocre computing power. Having 1 quad-hexa-octa core processor per GPU is a waste of money and computing power.

                                    If anyone has anything to add, or correct me at points, please do.

                                      • Future HW and SDK
                                        Meteorhead










                                        Instead of opening a new topic, let me post to a previous one:

                                        I know AMD employees will not speak about unreleased HW, so let me ask a theoretical question purely based on news, or information publicly available:

                                        Some future GPU of AMD (most likely top Southern Islands) will feature a brand new architecture designed from scratch, having kept in mind the needs of APU integration.

                                        http://wccftech.com/2011/06/15/amd-slides-detail-upcoming-radeon-hd-79-series-gpu-architecture/

                                        There is one thing I do not understand. How come that they advertise this architecture as being another step toward GPGPU applications, but I really cannot see how SIMD-vector process is "general". VLIW architecture excelled at being the sweetspot between graphics and GPGPU. Graphics used VLIW architecture as a vector processor, and GPGPU applications leveraged the compiler to vectorize scalar code. Having 16 wide SIMD, which 4 threads may share seems to me that one thread has the minimum of 4 wide SIMD. One thread simply cannot utilize 4 wide SIMD, unless it is vectorized code.

                                        As it seems to me:

                                        1) say good bye to cross-vendor OCL code. Scalar OCL code will utilize 25% of the card (35% max). Hail to HPC and scientific use, where we'll have to develop two seperate host- and kernel-side code.

                                        2) applications where vectorization cannot be done efficiently, will simply greatly underperform expectance on AMD HW.

                                        The new architecture seems awesome, I really like all the new stuff packed into this and big gratz to AMD for that. However, VLIW seemed like the strength of AMD to me, and I thought that as soon as superscalar architecture, or VLIW is left behind, all that will remain is an architecturally inferior Tesla. Architecture greatly developed, superscalar design remains, but SIMD is far inferior to VLIW.

                                        Please, someone tell me that I am wrong at some point. How will this be GPGPU?

                                          • Future HW and SDK
                                            maximmoroz

                                            Meteorhead, what's the problem with new architecture? From "instruction point of view":

                                            - Current architecture (VLIW): 16 stream cores, each contains 4 processing elements.

                                            - New one: 4x16 stream cores, each contains 1 processing element.

                                            You will no longer need to be frustrated about low ALU Packing number :) New architecture is similar to NVidia one, but with more processing elements per compute unit (and more compute units I guess).

                                            Looking at the pictures you gave link to... I have other concerns. What if single wavefront might be executed only at single 16-wide SIMD? It would mean that to be efficient the kernel should provide 4 or even 8 wavefornts per compute unit?

                                              • Future HW and SDK
                                                Meteorhead

                                                My concern is that on Fermi there is 32 Processing Elements (CUDA cores) inside a Compute Unit, and each PE has a scalar FP and INT units and one PE processes one thread only.

                                                Cayman has 16 Processing Elements (Stream cores) all of them a 4-VLIW. Each PE runs a single thread and can co-issue different operations down each VLIW lane!! 1MUL-2ADD-1CMP for eg. If scalar colde was written, this packing was done by the compiler.

                                                New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

                                                From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

                                                This contradiction of new architecture being GPGPU capable is only true, if the CU is the same in terminology as the OpenCL CU, thus there would have to be at least 96 Compute Units to result in at least the same amount of real processing elements (Stream Processors) as there is on a Cayman. If this is the case, that one Compute Unit will look like this, and having 16-wide SIMD is only the implementation of all threads inside a workgroup MUST do identical operations at all times, then it is OK. If this is true, then having 4*4-wide SIMD approach is an extension, namely that one Compute Unit can process different kernels at the same type, which would be very much useful for the asynchronous thread dispatch processor. If this is true, then this architecture is really close to being black magic.

                                                Wavefront size will still remain 64 threads, (if my speculation is true) because it will still be 4 cycles to reach a register, and with 16 ALUs (not counting the 17th scalar) the hardware will create 64-wide threadgroups to hide register latency.

                                                If my speculation is true, then it is true that this approach is closer to NV, but it might be even more flexible, with the ability to multitask on a single Compute Unit.

                                                  • Future HW and SDK
                                                    maximmoroz

                                                     

                                                    Originally posted by: Meteorhead

                                                    New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

                                                    From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

                                                    Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

                                                    It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

                                                      • Future HW and SDK
                                                        Meteorhead

                                                         

                                                        Originally posted by: maximmoroz
                                                        Originally posted by: Meteorhead

                                                         

                                                        New architecture has inside one CU (I do not know if this CU is identical with OpenCL CU terminology, but I suspect not) a 16-wide SIMD, which can be utilized by different threads, or a single thread on the CU, but at most by 4 with the most granular 4 (thhreads) * 4 (wide SIMD), and different threads can co-issue different ALU operations down the SIMD lane, BUT inside one thread, the 4-wide SIMD must have all identical operations. 4ADD or 4MUL for eg.

                                                         

                                                        From now one, if you do not have your code vectorized to at least 4-wide vectoroperations in every single thread, you will heavily underutilize the HW.

                                                         

                                                         

                                                        Meteorhead, I am sure that "16-wide SIMD" doesn't mean that 4 work-items at max will be executed at any given moment at this "16-wide SIMD"... block. No way. Instead one wavefront at any given time will occupy this block entirely executing 16 workitems out of its 64.

                                                         

                                                        It seems you think that VLIW4 stream core from Cayman architecture is widened to SIMD-16? Well, I think the opposite is true: This stream core is narrowed to single scalar processing element.

                                                         

                                                         

                                                        There were two possibilities:

                                                        (CU != OCL_CU) && (4VLIW >> 16SIMD)

                                                        OR

                                                        (CU == OCL_CU) && (4VLIW >> 1SISD)

                                                        If the first is true, it will be a big mess. If the second is true, than it will be very capable, but there must be significantly more CUs in a GPU then there are now. (Roughly ~100)

                                                        One wavefront occupying the 16-wide SIMD is logical enough, and will most likely be the default case. But since a wavefront is ALWAYS 64 wide, even if your workgroup is only 16 threads, the thread dispatch processor will create dummy threads for you to fill it up to 64 with operations masked from making output. Therefore there would be no sense in allowing 4*4 breakup of the SIMD array if it not were for different kernels being able to run on the same comput unit.

                                                        Can you think of any other scenario where it is useful?

                                                          • Future HW and SDK
                                                            maximmoroz

                                                            Meteorhead, sorry, I have completely lost track of the discussion.

                                                            Let me state what I got from the slides you linked to: AMD leaves VLIW architecture behind.

                                                            - Advantage: no more ALU Packing issue

                                                            - Possible disadvantage: While in VLIW architecture 2 wavefronts are enought to hide register access and ALU latency, the new architecture MIGHT require more wavefronts (4 or even 8) to hide that latency.

                                                              • Future HW and SDK
                                                                LeeHowes

                                                                Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding :)

                                                                The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.

                                                                Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

                                                                Remember, Cayman, the discussed architecture, and Fermi are all vector architectures - there are just subtle differences in how they issue vector instructions and how wide the vectors are.

                                                                ETA: remember that a thread as far as Fermi is concerned is a 32-wide vector, as far as Cayman is concerned it is a 64-wide vector. This is slightly different from the width of the hardware SIMD unit and also different from the way the word thread is used in CUDA. For the purposes of discussion not using the word thread at all might be clearer ;)

                                                                  • Future HW and SDK
                                                                    eduardoschardong

                                                                    Lee, I'm a bit confused too by those blocks and arrows... Can you help?

                                                                    There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?

                                                                     

                                                                    Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?

                                                                     

                                                                      • Future HW and SDK
                                                                        Meteorhead

                                                                        OK, I think I got it. Basically exactly the same amount of processors are found inside a CU, but instead of vectorizing scalar code in VLIW manner, all vector code is being "serialized".

                                                                        The drawback is that more wavefronts are needed to keep the ALU busy. If I'm not mistaken, computation is done in the following manner (same amount of work done in the given time:

                                                                         

                                                                        Cayman: Tick01: 00-15 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick02: 16-31 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick03: 32-47 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick04: 48-63 threads of Wavefront 0 compute 4 instructions at once in VLIW manner. Tick05: 00-15 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick06: 16-31 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick07: 32-47 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick08: 48-63 threads of Wavefront 1 compute 4 instructions at once in VLIW manner. Tick09: 00-15 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick10: 16-31 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick11: 32-47 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick12: 48-63 threads of Wavefront 2 compute 4 instructions at once in VLIW manner. Tick13: 00-15 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick14: 16-31 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick15: 32-47 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Tick16: 48-63 threads of Wavefront 3 compute 4 instructions at once in VLIW manner. Southern Island: Tick01: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick02: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick03: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick04: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick05: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick06: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick07: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick08: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick09: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick10: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick11: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick12: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick13: 00-15 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick14: 16-31 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick15: 32-47 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line. Tick16: 48-63 threads of Wavefront 0-3 compute 1 instruction. Each wavefront on a different SIMD line.

                                                                        • Future HW and SDK
                                                                          LeeHowes

                                                                           

                                                                          There is a diagram of the SIMDs with arrows between them but only in one direction, what does those arrows mean?


                                                                          Not a clue :)

                                                                           

                                                                          Also, it's now 10 wavefronts per SIMD and still 4 cycles per wavefront per instruction, and scalar, was single-threaded performance on compute bound kernels sacrified a LOT?


                                                                          *up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.

                                                                           

                                                                           

                                                                           

                                                                           

                                                                          Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).


                                                                          Right. But remember that when you load a Cayman unit with 2 wavefronts you are giving each wavefront 4 instructions - so in that same time with this archictecture you can issue 4 arbitrary instructions. And you'd never reach peak that way because every thread switch would leave a 40 cycle bubble in the pipeline. You'd need at least a third to cover those gaps. So yes, to fully specify the machine, assuming the same interleaving as Cayman (which I haven't asked anyone about so may or may not be the case) you would need 4x the number of wavefronts to keep it busy - but of course the arithmetic density in terms of time would go up as you spread the instructions out.

                                                                          GPUs are throughput architectures. Over 24 cores Cayman tends to need a couple of hundred threads to keep it busy - you can imagine needing more with this design, but in either case you're getting no efficiency if you run single threaded scalar code anyway so the same rough programming rules apply. Think of it as nothing but a bonus.

                                                                            • Future HW and SDK
                                                                              maximmoroz

                                                                              Lee, I see no problem with new architecture targeting large tasks :)

                                                                              • Future HW and SDK
                                                                                eduardoschardong

                                                                                 

                                                                                Originally posted by: LeeHowes *up to 10* The current design allows up to some high number per macro sequencer (ie per 10 SIMDs... 128 ,256, something like that. I forget). This is saying up to 10 per SIMD, or up to 40 per CU. Another way to look at that is that each micro sequencer tracks 40 program counters - if you use too many registers just like now the number you actually have state for can be lower.


                                                                                Thank you for the response, onde more, what the minimum number of wavefronts to fill up compute resources on the new chip (2, in the case of Cayman)? Or, asking in another way, what's the latency of each instruction in cycles (8 in the case of Cayman)?

                                                                                  • Future HW and SDK
                                                                                    dravisher

                                                                                    As I understood it from comments made by the guy who presented the GCN session (the first one in the parallel sessions, not the keynote speaker), GCN will require four wavefronts per CU to keep it fully occupied (not considering memory latencies of course). The question I asked was how many more work-items I would need to feed the CU with to keep it fully occupied, and the answer was that it doubled from Cayman so I don't think I misunderstood, but another confirmantion here would be nice.

                                                                                    What I'm still wondering though, is how this affects global memory latencies? Basically my question is: If we feed a Cayman CU and a GCN CU with four wavefronts, will the GCN be more strangled by global memory latencies than Cayman? With Cayman only a single wavefront is actually executing at any one time, so it does have others to switch to when waiting for global memory. With GCN all four wavefronts are actually executing at the same time, and so there is nothing to switch to (other than within the wavefronts). Would this lead to us needing more wavefronts per GCN CU to hide global memory latencies than we do on Cayman? I find this interesting since needing more wavefronts per CU in practice increases pressure on both LDS and registers. The LDS has doubled so that's fine, but the registers have stayed the same size per CU.

                                                                                    It would be very interesting if someone from AMD could clear this up, as it matters a great deal when designing kernels how many registers I can use without being totally screwed by global memory latencies :)

                                                                                    Edit: BTW was the move away from VLIW4 generally known before the GCN parallel session? It was actually mentioned in an earlier parallel session on the JIT compiler (a session with much fewer attendants). It wasn't given much attention, just a "oh, by the way, the next architecture is no longer VLIW". My jaw literally dropped when I saw that slide :-P.

                                                                                      • Future HW and SDK
                                                                                        bubu

                                                                                        So you're killing the VLWI and SIMD approach and adopting a scalar SMT arch finally?

                                                                                        • Future HW and SDK
                                                                                          Jawed

                                                                                           

                                                                                          Originally posted by: dravisherWhat I'm still wondering though, is how this affects global memory latencies? Basically my question is: If we feed a Cayman CU and a GCN CU with four wavefronts, will the GCN be more strangled by global memory latencies than Cayman? With Cayman only a single wavefront is actually executing at any one time, so it does have others to switch to when waiting for global memory. With GCN all four wavefronts are actually executing at the same time, and so there is nothing to switch to (other than within the wavefronts). Would this lead to us needing more wavefronts per GCN CU to hide global memory latencies than we do on Cayman? I find this interesting since needing more wavefronts per CU in practice increases pressure on both LDS and registers. The LDS has doubled so that's fine, but the registers have stayed the same size per CU.


                                                                                          GCN will no longer waste registers like the VLIW chips do. Register allocation on the current chips is terrible, hence all the complaints about register spill.

                                                                                          So GPRs will prove to be less of a constraint on the number of hardware threads per SIMD as the compiler won't be so profligate (fingers-crossed).

                                                                                          Of course if your algorithm wants to use a small number of hardware threads per SIMD due to a large workgroup size or large local memory allocation per work item, then you're stuck.

                                                                                      • Future HW and SDK
                                                                                        settle

                                                                                         

                                                                                        Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding :)

                                                                                         

                                                                                         

                                                                                        In current AMD GPUs each SIMD unit has 4 ALUs (plus possibly 1 SFU depending on the model).  I still can't understand how work-items, vector types, etc. get mapped to the ALUs in AMD GPUs (and CPUs).

                                                                                         

                                                                                        1. Does AMD APP SDK perform implicit vectorization for the GPU?  How about the CPU?  If not, any plans of providing it in the near future?
                                                                                        2. How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items?  What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item?  Will only 1/4 of ALUs be utilized?
                                                                                        3. How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)?  Or do I have that reversed?

                                                                                         

                                                                                        I guess I'm looking for a simple and clear statement (I do scientific computing but don't have a formal CS background) from AMD similar to the following from "Writing Optimal OpenCL Code with Intel OpenCL SDK" in section 2.5 Benefiting from Implicit Vectorization:

                                                                                        "Vectorization module transforms scalar operations on adjacent work-items into an
                                                                                        equivalent vector operation. When vector operations already exist in the kernel source
                                                                                        code, they are scalarized (broken down into component operations) and re-vectored."

                                                                                         

                                                                                        Thanks for your help clarifying these issues for me.

                                                                                          • Future HW and SDK
                                                                                            himanshu.gautam

                                                                                            Question: Does AMD APP SDK perform implicit vectorization for the GPU?  How about the CPU?  If not, any plans of providing it in the near future?

                                                                                            Answer: AMD APP SDK binds 4/5 independent instructions to  VLIW4/VLIW5 if it is able to find them. On CPU vectorization is done in similar situation.

                                                                                             

                                                                                            Question: How are VLIW-4 (or VLIW-5) packets formed, from 4 independent operations within a single work-item or 4 independent operations among 4 contiguous work-items?

                                                                                            Answer: 4 independent instructions within a work-item.

                                                                                             Question: What happens in a kernel that doesn't have 4 independent operations but only has one operation like an fma or mad in saxpy with scalar float--one float saxpy per work-item?  Will only 1/4 of ALUs be utilized?

                                                                                            Answer: Yes.

                                                                                            Question: How are current VLIW-4 packets executed within a SIMD, using all 4 ALUs at once (issue slots in space), or using 1 of the 4 ALUs over several cycles (issue slots in time)?  Or do I have that reversed?

                                                                                            Anwer: Instructions inside VLIW4/5 packets are executed simultanously on a SIMD. VLIW packets cannot be created from multiple work-items. All instructions in a VLIW packet must be from same work-item.

                                                                                             

                                                                                      • Future HW and SDK
                                                                                        maximmoroz

                                                                                         

                                                                                        Originally posted by: LeeHowes Might, however you have to remember that if you take a VLIW-5 packet and flatten it you get 5 issue slots in time instead of space. That's 5 (instruction) cycles worth of latency hiding :)

                                                                                        The architecture described in the talks had four 16-wide SIMD units per CU. It issues 4 waves over four cycles per CU - that's the same number of instructions as Cayman but laid out in time rather than space for each vector instruction.

                                                                                        Cayman has a 16 wide SIMD unit. The discussed architecture has a 16 wide SIMD unit. I'm not sure where the confusion is coming from?

                                                                                        Lee, I am able to efficiently load Cayman compute unit with just 2 ALU-intensive wavefronts. Would it be possible in new architecture? Only if new compute unit is able to execute single wavefront at several 16-wide SIMD blocks at the same time and the ALU and register access latency is 2 cycles. I doubt it. My guess is that it would require 4 or 8 ALU-intensive wavefronts to efficiently load single compute unit in new architecture (by the way, it is similar to NVidia's 6 wavefronts).

                                                                                        Well, it is a problem only when there is a small amount of wavefronts, that is the task is relatively small.

                                                            • Future HW and SDK
                                                              Meteorhead

                                                              I would ve very much interested what the DP throughput of this architecture is. It sometimes comes across my mind... "Maybe on the new 28nm somebody pulls off a native 64-bit ALU."

                                                              Or will it link 2 processors on the same SIMD to perform a DP operation similar to Cayman? Will DP performance be yet again 1/4 of SP, 1/2?

                                                                • Future HW and SDK
                                                                  dravisher

                                                                  Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.

                                                                    • Future HW and SDK
                                                                      Meteorhead

                                                                       

                                                                      Originally posted by: dravisher Well the statement was that DP (double precision) performance was 1/2, 1/4 or 1/16 depending on product (and all GCN products will have DP support). It wasn't entirely clear to me whether they meant that DP would be a mix of 1/2 and 1/4 (like today), or 1/2 on some products and 1/4 on others. However Anandtechs article states 1/2, but of course they could have misunderstood, I don't know.

                                                                       

                                                                      What do you mean that "mix of 1/2 and 1/4 (like today)"? How is it today a mix of these? As far as I know on VLIW4, 2-2 processors link to perform a DP operation, and since in linked mode they cannot perform FMAD (by which GFLOPS is measured) performance is divided again to a total of 1/4 = 1 / 2 (link) / 2 (FMAD inability). But it is not a mix of 1/2 and 1/4. Cayman has 1/4, period.

                                                                      1/2 on new architecture would ROCK, but I would be curious how it is achieved. :)

                                                                        • Future HW and SDK
                                                                          dravisher

                                                                          Meteorhead: There's some confusion on this point, but see for example table 4.14 in the AMD APP OpenCL Programming Guide 1.2d. For Cypress (but it basically stays the same for Cayman except we have one less unit from what I know) we have the following capabilities per processing element per clock (DP in parentheses):

                                                                          FMA: 4 (1)

                                                                          MAD: 5 (1)

                                                                          ADD: 5 (2)

                                                                          MUL: 5 (1)

                                                                          So the DP performance for Cypress is 1/5 for MAD and MUL, 2/5 for ADD. For Cayman the equivalent numbers are 1/4 and 1/2, QED :p

                                                                          • Future HW and SDK
                                                                            ED1980

                                                                            As I understand it, in the professional version (FirePro) -1 / 2 ,  Hi-End gaming version 1 / 4, the other 1 / 16.

                                                                            The restriction is likely to be as in Nvidia, specially made ?(in the driver software)

                                                                            • Future HW and SDK
                                                                              nou

                                                                              IMHO high end chip like Cayman will have 1/2 in DP and mid-range will have 1/4 and low-end only 1/16.

                                                                                • Future HW and SDK
                                                                                  Meteorhead

                                                                                  I highly doubt AMD wishes to insert restrictions into their HW similar to NV for the following pupose: AMD does not follow monolithic chip design (although GF116 is a viable chip). For AMD to perform in the top gaming and HPC segment, they have to create dual-GPU solutions, and up until now there were none dual-GPU professional cards, only gaming cards are dual-GPU. FirePro is optimized for CAD programs, and they are not optimized for multi-GPU applications, so most likely there will be no multi-GPU FirePros in the future also.

                                                                                  If they were to insert restrictions into gaming HW for the sole reason of enforcing people to buy FirePros, they cut themselves from the high-end HPC segment completely.

                                                                                  My guess goes with nou, it will be class-dependant how the chips will perform in DP.

                                                                                    • Future HW and SDK
                                                                                      laobrasuca

                                                                                       

                                                                                      Originally posted by: MeteorheadFirePro is optimized for CAD programs, and they are not optimized for multi-GPU applications, so most likely there will be no multi-GPU FirePros in the future also.


                                                                                      unless they create a new line up to compete directly with Teslas. Since AMD wants to make a name on the GPGPU, it would not be so surprising. NVIDIA had maybe not all the success they expected to on the software side about CUDA (push every big software company to make plugins using CUDA), but they certainly sell lots of Teslas with the new supercomputers out there. I ignore how much dollars it represents, but I can guess that AMD would like to compete in this segment also.

                                                                                      All this makes me think about some people who said, like: "VLWI is the strength of AMD, it will never disappear". Well, seems not... I'm really interested on this new architecture and how it will improve things on the GPGPU side. Only wonder if it will be inside HD8000 or HD9000 series (or maybe another name, why not!)

                                                                                        • Future HW and SDK
                                                                                          ryta1203

                                                                                           

                                                                                          All this makes me think about some people who said, like: "VLWI is the strength of AMD, it will never disappear". Well, seems not...

                                                                                          I don't think that's necessarily the case, what it does mean is that AMD feels there is a market that they can better compete in by moving to this new architecture (ie being more similar to Nvidia). Like I said in another thread, I'd be really surprised if the 1st generation of these new cards can compete with a generation back of the VLIW cards when it comes to algorithms like MM, for example, my guess is that the peak is going to go down and since the most optimized MM is getting over 90% peak...

                                                                                            • Future HW and SDK
                                                                                              bubu

                                                                                              APUs are very interesting for the HPC world: they are small enough to fit in a 1U rack and they consume low power. However, without DP support the product has a big handicap.

                                                                                              I hope AMD could make a Fusion APU version with full DP support soon ( Opteron APU? )

                                                                                              • Future HW and SDK
                                                                                                laobrasuca

                                                                                                I'd be really surprised if the 1st generation of these new cards can compete with a generation back of the VLIW cards when it comes to algorithms like MM

                                                                                                 

                                                                                                The problem is that the MM is the perfectly parallel use-case, but most of the algorithms used by us all today are very very far way from this perfect fit case. Having an architecture that is in average better than the current one will make it sell better. How much it worth an architecture whose peak performance is the fastest but can only be achieved in a very small number of cases? Sure, games are one of these cases, for now. But even there shaders have been more and more complex and the new to come compute pipeline of opengl will make shader even more flexible, compute friendly, closer to opencl in some way. AMD architects have be seen this for some time now. It's time to move on. It would surprise me if new cards will be slower than current ones for games. New architecture will be forged using smaller form factor, so it will give us more fps, AMD marketing guys will sell it as the fastest architecture ever and gamers will be happy. Plus, it will be faster for GPGPU, compilers will fit better and OpenCL will get closer and closer to CUDA. Better yet, same kernel will have smaller performance difference between AMD and NVDIA than today, making code development easier and more general. More yet, we will have price war in almost all segments. What else could people ask for!

                                                                                          • Future HW and SDK
                                                                                            moozoo

                                                                                            I hate the whole "DP is not important for consumer uses" argument.

                                                                                            What if Intel/AMD took this to heart and chopped the 80bit x87 FPU to 32 bits to make a consumer level CPU...

                                                                                            Fact is that most software uses the x87 instructions to perform all calculations and only casts back to double for storage.

                                                                                            Excel uses doubles, would people be happy if microsoft release a consumer version that only used single precision?

                                                                                            All numbers in javascript are double. Why so if single is all consumers need.

                                                                                            Fact is that other than multimedia, games and video compress every other computation the average pc user does is in double precision.

                                                                                            If openCL wants to move out of these areas and into general computing then double precision is a requirement not an optional addon.

                                                                                             

                                                                                    • Future HW and SDK
                                                                                      MicahVillmow
                                                                                      moozoo,
                                                                                      We have double precision on our high-end consumer cards for the last 4 generations. The problem is not whether it is important for consumer use or not, but of hardware size/cost trade-off. A very small chip gets less double precision(or none at all) compared to the larger chips. A low end chip with the same single precision performance plus double precision would cost more, use more power and produce more heat for double performance that isn't much better than the CPU.

                                                                                      Most of your examples are software, which double can be done whether the chip supports double or not. The same can be said of OpenCL, someone can write a double precision library. A better comparison is to extra features like Hyper-threading, or trusted platform modules, which only exist on certain high end/enthusiast/server parts but not on the 'consumer' parts.

                                                                                      I'm not saying we shouldn't go to that path as it would make my life easier to be able to develop double code on a laptop, but we does not view DP as professional/HPC only use.


                                                                                        • Future HW and SDK
                                                                                          moozoo

                                                                                           

                                                                                          Originally posted by: MicahVillmow  but we does not view DP as professional/HPC only use.


                                                                                          Thanks Micah.

                                                                                          I guess I'm concerned  high DP performance will be reserved for HPC products as per Nvidia. There is a huge price gap between the highest Nvidia consumer graphics card and the cheapest Tesla.

                                                                                          I  fully accept that you (AMD) should try and differentiate your HPC parts. But I feel this should be on the basis of relablity, ecc, thermal design, designed for packed blade use, fast detailed support and additional driver features (Infiniband performance) etc

                                                                                           

                                                                                            • Future HW and SDK
                                                                                              Meteorhead

                                                                                              I agree also. Making a dual-GPU, double size ECC VRAM, strictly front-to-back cooled HPC card, with proper driver (Xorg independant), fit for close packing (meaning the cooler is 4 mm thinner than a double-width cooling solution) would indeed be welcome and be worth the extra money.

                                                                                          • Re: Future HW and SDK
                                                                                            Meteorhead

                                                                                            I cannot seem to find the wishlist topic from the old forum for new SDK features, so let me post one here (and if the old topic exists, feel free to move this post there).

                                                                                             

                                                                                            I would have a feature request, that I think would be most useful to people. NV's UVA is a really awesome feature, but although it will most likely never make it OpenCL (or it will a few years from now) I am not that much interested in it. However it is a compelling feature of CUDA. I was thinking of how this could be possible in OpenCL.

                                                                                             

                                                                                            First quick question: how exactly is clEnqueueCopyBuffer implemented? Does it utilize pinned RAM or does it copy straight from device to device without CPU intervention? Because if this is the case, that is really decent and something on par with DirectGPU of NV.

                                                                                             

                                                                                            Secondly, since AMD is really moving towards Fusion, (which sadly and ultimately is a sign of discrete graphics disappearing because even a Quad-socket Fusion rack server will never bring the computing power of 4 dedicated graphics cards because of the cooling) I was thinking of using an alternate solution. AMD has recently released the Partially Resident Texture demo of Leo, and that gave me the idea:

                                                                                             

                                                                                            Could it be implemented, that partially resident buffers be used OpenCL? Although the technology mainly aims read-only textures, it would really rock if this approach could be used for GPGPU. One server with 256GB-of RAM will surpass any dedicated VRAM available in a system and would allow real neat simulations to run if datastream could be implmeneted efficiently. Since this technique is already used in OpenGL, which also uses IL and ISA (AFAIK), from there on under not much has to be changed (as I would expect), only a different interface has to be implemented. This would be something similar as UVA, but would allow much larger buffers. I know that things get complicated when the GPUs not only read but write this buffer too, but cache coherency with the CPU is already made, so why couldn't this be done? Or is it really not possible to get better performance than using host pointers on devices?

                                                                                              • Re: Future HW and SDK
                                                                                                nou

                                                                                                well i think that partial resident buffers can be implemented with current API just fine. imagine creating HUGE buffer. from this buffer you create subbuffers and use them in kernel calls. it is just matter of runtime implementation. just add something like clCreateSubImage()

                                                                                                  • Re: Future HW and SDK
                                                                                                    Meteorhead

                                                                                                    It is not completely identical, because in this situation the programmer has to keep track of which part of the image is modified by which device. This is not possible in the case when the location of reading is decided at runtime inside the kernel. Partially resident textures reduce VRAM to a cache, and the implementation streams the part of the texture into VRAM that is needed and hopes that in the next frame (iteration), the same side of the model (system) will be loaded on the given device, and in that next time the data is already present inside the cache. But if the orientation of the model would be decided inside the shader (kernel), I really wouldn't want to pass this back to host to be able to load the appropriate SubImage into VRAM.

                                                                                                     

                                                                                                    This is where the vendor extension would come in handy. Hope I was clear. Aside from that, I will consider your idea, because it might be sufficient in my case, but it might turn out that it's just not flexible enough. (In my simulation I have moving borders inside the system, which would correspond to moving the borders around of the SubImages. If I can do this efficiently by not recreating the SubImages, but by doing clEnqueueCopyImageRect (or something like that) and copy just the updated parts of the image, than it will be ok. But I'll have to think about that if that works or not.