6 Replies Latest reply on Nov 29, 2013 12:57 AM by daxh

    Is APU always faster then GPU ?

    daxh

      Hello to everybody.

       

      I am new to OpenCL and I've faced with the following problem: all my kernels works up to 10 times faster on APU. I've always expect that GPU should works faster.

      My platform is HP Envy laptop with OpenSuse 12.3, AMD APU A10 4600M 2.3 GHz (3.2 GHz,  Turbo mode) includes 4 cores and AMD Radeon HD 7660G, RAM 8 GB, and discrete GPU AMD Radeon HD 7670M — 2048 Mb.

       

      Size of test array is 1024, I am using

       

      krnlTest.getWorkGroupInfo< CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE > (device)

       

      to define wavefront size. And I've got the following results here:

       

      Preffered wg size multiple for Devastator is 64.
      Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.
      
      
      


      So, I've decided to split 1024 into at least 4 work groups (and 256 is a multiple for 64). But I've got the following results:

       

      work_group_size = 256

       

      -> OpenCL test time is 0.6948ms DEVICE: Devastator

      -> OpenCL test time is 0.0802ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


      work_group_size = 128


      --> OpenCL test time is 0.801ms DEVICE: Devastator

      --> OpenCL test time is 0.0912ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


      work_group_size = 64

      --> OpenCL test time is 0.6846ms DEVICE: Devastator

      --> OpenCL test time is 0.0946ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


      Also, just now I've realized that possibly there should be three OpenCL devices (APU, integrated GPU and discrete GPU), am I missing something here ? Because when I am using aticonfig to change device priority (between power-saving GPU and high-performance GPU) I get always just 2 OpenCL devices, one of them is always AMD A10-4600M APU with Radeon(tm) HD Graphics, but another could be called Devastator or Tahiti (or something like this).


      This is how I am measuring time:

      
      
      typedef std::chrono::steady_clock::time_point t_time_point;
      typedef std::chrono::duration<int,std::milli> t_millisecs;
      
      inline t_time_point getTimePoint()
      {
          return std::chrono::steady_clock::now();
      }
      
      inline unsigned int getTimeDifference_ms(t_time_point start)
      {
          t_time_point end = getTimePoint();
          t_millisecs duration( std::chrono::duration_cast<t_millisecs>(end-start));
          return duration.count();
      }
      
      // Performing tests
      t_time_point start = getTimePoint();
      for(int iTest=0; iTest<numberOfTests;iTest++)
      {
          // Creating buffers
          cl::Buffer clVectorA = cl::Buffer(context,
                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
                  pVectorA);
          cl::Buffer clVectorB = cl::Buffer(context,
                  CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
                  pVectorB);
          cl::Buffer clVectorC = cl::Buffer(context,
                  CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
                  pVectorC);
      
          // Set kernel args
          SET_KRNL_ARG_4(krnlTest, clVectorA, clVectorB, clVectorC, size)
          krnlTest.setArg(4, size * sizeof(cl_int), NULL);
      
          // Launching
          const int global_size = size;
          const int local_size = size/16;
          queue.enqueueNDRangeKernel(
              krnlTest,
              cl::NullRange,
              cl::NDRange(global_size),
              cl::NDRange(local_size));
      
          // Finishing
          queue.enqueueReadBuffer(clVectorC, CL_FALSE, 0, size * sizeof(cl_int), pVectorC);
          queue.finish();
      
      }
      unsigned int ms = getTimeDifference_ms(start);
      
      std::cout << "\n--> OpenCL test time is " <<((double)ms/numberOfTests) << "ms"
                << "\t\t DEVICE: " << (std::string)(device.getInfo< CL_DEVICE_NAME >());
      
      
      

       

      This is the kernel :

       

      #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
      #pragma OPENCL EXTENSION cl_amd_printf : enable
      
      __kernel void test_kernel(
          __global int* pVectorA,
          __global int* pVectorB,
          __global int* pVectorC,
          int size,
          __local  int* pLocalTmp)
      {
          int gid = get_global_id(0);
          int lid = get_local_id(0);
          int gsize = get_global_size(0);
          int lsize = get_local_size(0);
          int grum = get_num_groups(0);
      
      //    if(gid == 0)printf("gsize=%d\tlsize=%d\tgrum=%d\n", gsize, lsize, grum);
      
          pLocalTmp[gid] = pVectorA[gid] * pVectorB[gid] + 4;
      
          barrier(CLK_LOCAL_MEM_FENCE);
      
          pVectorC[(size-1)-gid] = pLocalTmp[gid];
      }
      
      
      

       

      In attachement you could find the whole sources (this is a QtCreator project)

        • Re: Is APU always faster then GPU ?
          himanshu.gautam

          Hi,

          I am not aware much about the hardware. but i came across a forum site where they had discussed about CPU , GPU and APU. hope this may help you to understand the concept

           

          http://forums.anandtech.com/showthread.php?t=2141007

          1 of 1 people found this helpful
          • Re: Is APU always faster then GPU ?
            daxh

            Looks like whole my time measurement methodology was wrong. I still reading "AMD Accelerated Parallel Processing OpenCL Programming Guide" and I've found there a great example "Parallel Min() function" (this is the second example in the whole book, page 41, paragraph 1.10.2). Also this example includes simple benchmarking approach. I've ported this example to OpenCL C++ Wrapper (QtCreator project in attachement to original post added). And now I can see interesting results.

             

            If I enable power-saving mode in amdcccle, then I've got:

             

            DEVICE: GPU, Devastator

            COMPUT_UNITS: 6

            GLOBAL_WORK_SIZE: 4096

            LOCAL_WORK_SIZE: 64

            NUM_GROUPS: 64

            TIME: 5.29497

            B/W 6.34 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

            result IS correct

             

            DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

            COMPUT_UNITS: 4

            GLOBAL_WORK_SIZE: 4

            LOCAL_WORK_SIZE: 1

            NUM_GROUPS: 4

            TIME: 9.10732

            B/W 3.68 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

            result IS correct

             

            And If I enable high-performance mode, then I've got the following:

             

            DEVICE: GPU, Turks

            COMPUT_UNITS: 6

            GLOBAL_WORK_SIZE: 4096

            LOCAL_WORK_SIZE: 64

            NUM_GROUPS: 64

            TIME: 15.6707

            B/W 2.14 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

            result IS correct

             

            DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

            COMPUT_UNITS: 4

            GLOBAL_WORK_SIZE: 4

            LOCAL_WORK_SIZE: 1

            NUM_GROUPS: 4

            TIME: 8.82985

            B/W 3.80 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

            result IS correct

             

            So, now I see, that I really have 3 OpenCL devices, but due to drivers features (or I don't know why), I could use only two of them at once.

             

            And the performance distributed between them in the following manner:

             

            1) GPU AMD Radeon HD 7660G {Devastator} - is the fastets

            2) CPU AMD A10 - is in the middle

            3) AMD Radeon HD 7670M {Turks} - is the slowest

            • Re: Is APU always faster then GPU ?
              moozoo

              >Size of test array is 1024, I am using

               

              try this with a test array size of 100,000

              I believe with only 1024 the GPU overheads will make up the bulk of the time.

              1 of 1 people found this helpful
              • Re: Is APU always faster then GPU ?
                daxh

                I have one more question here. What is the "fair" approach to measure bandwith or algorithm working time? I mean, there are two ways:

                 

                1) Approach_1 - "Fair". Allow me to use just some pseudo-code:

                 

                int numberOfTests = 1000;

                t_time_point start = getTimePoint();

                for(int iTest=0; iTest<numberOfTests;iTest++)

                {

                     // 1 Creating buffers

                     {...}

                 

                     // 2 Setting kernel arguments

                     {...}

                 

                     // 3 EnqueueNDRangeKernel

                     {...}

                 

                     // 4 EnqueueReadBuffer

                     {...}

                 

                     queue.finish();

                }

                unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

                float secs = (float)ms/1000; // algorithm working time in seconds

                 

                But according to erlier mentioned by me Parallel Min() function example we can use another approach.

                 

                2) Approach_1 - "UN_Fair":

                 

                int numberOfTests = 1000;

                t_time_point start = getTimePoint();

                     // 1 Creating buffers

                     {...}

                 

                     // 2 Setting kernel arguments

                     {...}

                for(int iTest=0; iTest<numberOfTests;iTest++)

                {

                     // 3 EnqueueNDRangeKernel

                     {...}

                }

                     // 4 EnqueueReadBuffer

                     {...}

                queue.finish();

                unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

                float secs = (float)ms/1000; // algorithm working time in seconds

                 

                And of course results are very different:

                 

                FAIR

                 

                Preffered wg size multiple for Devastator is 64.

                DEVICE: Devastator

                TIME: 0.572ms

                B/W: 0.0214825 GB/s

                 

                Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

                DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

                TIME: 0.088ms

                B/W: 0.139636 GB/s

                 

                NOT FAIR

                 

                Preffered wg size multiple for Devastator is 64.

                DEVICE: Devastator

                TIME: 0.01ms

                B/W: 1.2288 GB/s

                 

                Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

                DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

                TIME: 0.044ms

                B/W: 0.279273 GB/s

                 

                In attachment to original post I've added new version of MatricesTest (QtCreator project) that contains related code. I hope you could help me to understand, because according to testing approach the results are absolutely different.