13 Replies Latest reply on Jan 28, 2016 5:13 PM by bsp2020

    Best 8GB AMD GPU for GPGPU OpenCL computational geometry research?

    cesss

      Hi,

       

      I'm doing a GPGPU research project with OpenCL, and I need to purchase a current GPU, with the best available performance in OpenCL. It's important to get the highest OpenCL performance possible. Another requisite is having 8GB VRAM.

       

      My research is about computational geometry. It runs fine on FP32, but what I really need is some guarantee that, if my algorithm runs slower than desired, there isn't any other AMD board in the market that would run it faster.

       

      Looking at the possibilities, I see two options which have 8GB:

       

      -The FirePro W8100 (peaks: 4.2TFLOPS in FP32, 2.1TFLOPS in FP64).

      -The Radeon R9 390x (peaks: 5.9TFLOPS in FP32, 1/8th in FP64).

       

      If I didn't have the 8GB requisite, I see the Radeon Fury X achieves 8.6TFLOPS peak in FP32, but with the limitation of going down from 8GB to 4GB.

       

      I believe I must be missing something here. Does the Radeon Fire X beat all professional FirePro boards in FP32, including even the W9100 and the S9170? Of course, the question is whether I need FP64 performance or not. I had experience with NVIDIA Tesla boards years ago, which offered an FP64 performance very close to FP32, so I usually chose FP64 because you almost got it for free on such boards. But I don't really need FP64, my first requisite is very high OpenCL performance rather than having good FP64 numbers.

       

      Thanks a lot for any ideas/suggestions,

       

      ceses

        • Re: Best 8GB AMD GPU for GPGPU OpenCL computational geometry research?
          bsp2020

          Can you share a bit more about the nature of your project?

          Is 8GB requirement a hard requirement? Could you explain the nature of your memory requirement? Do you need ECC?

          Also, which OS/development tools you plan to use?

           

          You are not missing something. Typically, if you do not need FP64, ECC or large memory capacity, you can get by using consumer version of GPU for compute and get much better bang for the buck. It is same for NVIdia products as well. You could have bought Geforce instead of Tesla and pay much less and get as good FP32 performance if FP64/ECC/large memory capacity did not matter.

            • Re: Best 8GB AMD GPU for GPGPU OpenCL computational geometry research?
              cesss

              Thanks a lot for the explanation. The 8GB requisite isn't for the

              development of the algorithm I'm working on right now, but for other

              project which involves very large satellite imagery, and I wished to

              use the same GPU for both projects. However, considering my #1 priority in

              this moment is the confidence that the OpenCL performance I'm getting is

              the fastest I can expect for a single AMD device at the moment, my choice

              should be the R9 Fury X and leave the 8GB project for another purchase. Is

              this correct, or am I forgetting any TFLOP monster under the table?

               

              I also looked at the R9 295x2, which many claim as still the most

              TFLOP-delivering card at the moment, but looking at Luxmark I found that

              single device Fiji GPUs get a higher mark than the 295x2 using the 2

              devices. Also, if I'm reading the docs correctly, the 295x2 8GB is seen as

              4GB by the applications, so it wouldn't give me more memory than the Fury X.

              • Re: Best 8GB AMD GPU for GPGPU OpenCL computational geometry research?
                cesss

                By the way, I forgot to reply your question about OS and tools. My system of choice is OS X but, unfortunately it doesn't have an OpenCL profiler, and I really need such a tool for this research, because I need to monitor what's happening inside the GPU and I've never been able to do this on OS X. So I'm moving to Linux for this research, both because I need an OpenCL profiler, and because I can get more TFLOPS than on OS X (even if I got a Mac Pro with 2x D700).

                 

                If the R9 Fury X has any issues on Linux, or if the AMD OpenCL profiler isn't 100% functional for this GPU on Linux, please tell me, because this is the configuration I'm considering right now.

                 

                Thanks!

                 

                cesss

                  • Re: Best 8GB AMD GPU for GPGPU OpenCL computational geometry research?
                    realhet

                    x2 cards: The two chips have separate memory. They can't access each others memory directly. x2 is like buying two cards but with less power consumption.

                     

                    290 vs. Fury: Fury is a great card (at the moment it has the best FP32 performance and memory bandwidth too). Fury uses HBM techology so it's more power efficient and it has 2x more mem bandwith than previous cards.

                     

                    Memory usage vs. math performance: If you use way more memory operations than math, memory bandwidth will become the bottleneck, and you'll end up not utilizing much of the TFlops/s of the card. On a pre-Fury card, I'd suggest to read 32bits for every 32 alu instructions. Yes, it's not that good ratio, but it's still a GPU, not a CPU. I can only quess that on the Fury this ratio became 1 DWord read : 16 alu instructions.

                    But on GPU it is wise to thinking of data compression: It will need less space but it will use more ALU to compress data, but that would be sleeping ALU anyways if you aren't using any compression. This compression is a trend in graphics too started long ago: The latest algorithm is Delta Color Compression. But it's always dependent of the data and precision requirement what algorithm is best for your particular data.

                     

                    Streaming vs. all data in the memory:You better split up the work to 256MB (for example) blocks if you can. You can transfer a block in the same time as the GPU works on the previously transferred block. I might seem so easy to upload a very large image in one time, but GPU needs memory locality to work effective. If you rearrange the algorithm and/or the data structures to support better memory locality, you can have a sum of TByte/sec bandwidth rates from the caches of all the compute units.

                     

                    What Fury model: Not the Nano. It's there for the small form factor, and it prevents overheating by slowing itself down. Choose from the 2 big models: the one with liquid cooling or the other with regular large fans.