3 Replies Latest reply on May 30, 2011 9:51 AM by laobrasuca

    16 core Opteron

    dnorric

      Hi how would a 16 core opteron (8356) server compare to a GPU for opencl. What i mean to say is what GPU scenario would be as quick as a 16 core opteron server or via versa.

      Cheers

      Damian

        • 16 core Opteron
          LeeHowes

          A rather open ended question.

          16 Opteron cores at 2.5GHz would be 320GFLOPs single, 160 double peak throughput (off the top of my head, may have missed something). The 6970 would be 2700 and 700 accordingly. So in theory it's a factor of about 4 peak for double. Memory bandwidth, two sockets, something like 30 or 40GBps compared with 170. So maybe factor of 5 there. So somewhere around 5x performance for the GPU at peak compared with the CPU assuming you use OpenCL vectors throughout and efficiently use memory.

          Randomly access memory but constantly hit in the CPU cache and the CPU could easily win. Include transfer time to the GPU in the calculation and the difference is less significant still.

          On the other hand, adding the GPU to a 16 core Opteron server is extra compute power in the same box. You could add multiple GPUs too. If that's 16 opteron cores in a *single* socket then the clock speed and memory bandwidth would drop.

          The reality is it depends on the code you're trying to run and how good you are at vectorising your algorithm.

            • 16 core Opteron
              dnorric

              Thanks. I think i might hold out until june when the bulldozer is meant to released and the 7000 series at the end of the year

              Cheers

                • 16 core Opteron
                  laobrasuca

                  but remember that the 16 cores are actually 16 int cores, not 16 float cores. Each 2 integer cores holds in a same module, where you have only one float scheduler. It will depends on how your code will run in. But it seems that the bulldozer architecture is way more suitable for OpenCL than the current one (not a big surprise) with L1, L2 and L3 caches disposed likely as private, local and global memory. Actually, I ask myself how L3 cache and system memory will be accessible to the programmer, would the L3 cache be the constant memory and system memory the global memory?

                  a bulldozer module

                  a bulldozer architecture