9 Replies Latest reply on Jun 1, 2011 7:00 AM by houyunqing

    AMD hardware, OpenCL and CUDA

    houyunqing
      Questions from a CUDA developer

       

      I'm starting to look at AMD's hardware and I'm surprised by the GFLOP numbers (for the 6970: 384*4*2*880/1024=2640GFLOPS). Shouldn't the AMD cards be significantly faster than NVIDIA cards for arithmetic-intensive kernels, since GTX 580 only has 1544 GFLOPs(using the same computation method above)?

      Why is it that all the fastest super-computers in China, Japan and the States are all using NVIDIA's Teslas? Those Tesla cards are many many times more expensive than their AMD counterparts! Does it have anything to do with floating point IEEE-compliance?

      Apart from that, I have a few other questions. Thanks for any help in advance.

      1. NVIDIA's native ISA doesn't have a proper name (I call it Fermi ISA for the current generation) because NVIDIA does not disclose much information about it. (They only provide a high-level assembly-like language, PTX, for CUDA developers to work with. I think it's just like the AMD IL) Does AMD provide developers with comprehensive information about the their native ISA?

      Also, NVIDIA provides developers with a cuobjdump, which disassembles cubins. This program is the only source of information about the their native ISA. Does AMD provide developers with this kind of similar disassembler, and perhaps an assembler as well?

      2. What is the length of a native VLIW? Are the stream cores capable of sustained 5/4-scalar operations per clock? I'm thinking if th VLIWs are long enough it might be very demanding for the instruction cache.

      3. How many immediate values can a native VLIW contain? In reality, do the stream cores often issue 4/5 instructions in parallel since many instructions may contain 8-bit, 16-bit or perhaps 32-bit immediate values?

      4. How much information does AMD provide regarding their hardware? Like caching behaviour(replacement policy, cacheline size, associativity..), ld/st latency, arithmetic latency, memory channel width and so on?

      5. In general, I would also like to know the various differences between CUDA C and OpenCL C. I am aware of the basic OpenCL terminology.

      Again, thanks for your time and help!



        • AMD hardware, OpenCL and CUDA
          galmok

          Tesla cards have more memory and it is ecc memory. Also, Tesla has half the double precision performance compared to single precision performance. For normal consumer cards, the ratio is 1:8 (artificially limited). Also, the Tesla cards don't appear as VGA adapters and as such do not require desktop or monitor output to be attached.

          It would be nice if AMD had a Tesla counterpart, but I guess they wait until their OpenCL implementation is more mature perhaps.

            • AMD hardware, OpenCL and CUDA
              laobrasuca

               

              Originally posted by: galmok It would be nice if AMD had a Tesla counterpart


              FireStream?

                • AMD hardware, OpenCL and CUDA
                  galmok

                   

                  Originally posted by: laobrasuca
                  Originally posted by: galmok It would be nice if AMD had a Tesla counterpart


                   

                  FireStream?

                   

                   

                  True, there is the firestream brand. I had forgotten about that as there doesn't seem to be anyone in Denmark that sells them. :-/

                  The largest Firestream seems to have 4GB memory. The Tesla has up to 6GB. Both quite large, but both too smal for our use. :-P

                   

                • AMD hardware, OpenCL and CUDA
                  houyunqing

                  EEC is indeed a concern. Thanks for that quick reply.

                  Though it appears the 6970 has a rather impressive (on par with the Teslas) GFLOP number for double as well. Of course, that number could be at least halved if there are certain limitations on the VLIW and the multi-issuing of the stream cores.

                  The Tesla cards cost almost $USD 4000 a piece. I think NVIDIA really needs some competition.

                    • AMD hardware, OpenCL and CUDA
                      himanshu.gautam

                      That is a lot of questions.

                      As far as the concern related to performing efficient utilization of VLIW, some techniques like Loop unrolling is suggested. You can look at the ISA of your kernel and how it is distributed using either SKA or profiler or even dumping the ISA. Some documents of interest which can be easily found are AMD APP SDK OpenCL Programming Guide, and ISA documents for GPUs.

                      Galmok has definately pointed out some important facts, but work is being done on them.

                  • AMD hardware, OpenCL and CUDA
                    ED1980

                     

                    Originally posted by: houyunqing

                      I'm starting to look at AMD's hardware and I'm surprised by the GFLOP numbers (for the 6970: 384*4*2*880/1024=2640GFLOPS).

                     

                     

                      for the 6970: 384*4*2*0.88GHz=2703.36GFLOPS... Where did you took 1024 in terms?

                    • AMD hardware, OpenCL and CUDA
                      empty_knapsack

                      houyunqing,

                       

                      yes, in raw GFLOPS AMD cards are faster (and cheaper) than NVIDIA ones. But there are some "issues".

                       

                      1.  AMD ISA is published (check out documentation section). It's possible to disassemble kernels (written in IL or OpenCL and then compiled to ISA) but assembler support was dropped since 4XXX family and there no plans to reimplement it. Actually there are plans to drop IL support as well (and I have no idea why AMD doing that).

                      2.  One instruction takes 8 bytes (+additional bytes for constants if they are presents) and they can be grouped up to 5 per one VLIW. Yes, if you have enough independent calculations you can see 5 instructions per VLIW issued. Once your kernel size hits 48Kb you'll face performance issues because of instruction cache overflow.

                      3. There no 8 or 16-bit immediates for AMD GPUs, only 32-bit ones. When you're working with 8-bit memory values in reality it'll be tranformed to several instructions involving 32-bit AND/OR.

                      4. Not much. There is only one guy from AMD who provide technical information about AMD GPUs at this forum. Unfortunately.

                       5. The main difference is that CUDA is way more mature and bugs free. Also NVIDIA GPUs much closer to CPUs (looking at instruction set) while AMD GPUs are more limited and can be (very) unsuitable for some algorithms.

                       

                      All in all, for some tasks you'll find that AMD GPUs are very fast and really cheap, so NVIDIA GPUs can't complete with them at all. For other tasks (usually more complex algorithms) you'll face the fact that it's simply impossible to use AMD GPUs with current AMD OpenCL implementation and drivers/SDK "issues".

                        • AMD hardware, OpenCL and CUDA
                          LeeHowes

                           

                          4. Not much. There is only one guy from AMD who provide technical information about AMD GPUs at this forum. Unfortunately.

                           

                          That, and the programming guide with a thorough performance chapter and the publication of the instruction set. 

                           

                          • AMD hardware, OpenCL and CUDA
                            houyunqing

                            Thanks everybody for your kind input!

                            Especially to empty_knapsack, thanks for that information. It's very helpful!

                            I now understand that the one thing that is needed for the AMD cards to run fast is ILP. NVIDIA CC2.0 cards have no multi-issuing at all and thus do not require any ILP for full utilization of the cores.

                            I think I'll buy an AMD card at some point in time to gain a better understanding of my alternatives.

                            Thanks again!!