5 Replies Latest reply on Feb 20, 2013 6:07 AM by himanshu.gautam

    GPU is not faster than CPU


      Hi everyone,


      I created a kernel using OpenCL, but I do not see it run faster than CPU. Can I do anything to make it run faster? The code is here:


          int r[2800 + 1];
          int i, k;
          int b, d;
          int c = 0;

          for (i = 0; i < 2800; i++) {
              r[i] = 2000;

          for (k = 2800; k > 0; k -= 14) {
              d = 0;

              i = k;
              for (;;) {
                  d += r[i] * 10000;
                  b = 2 * i - 1;

                  r[i] = d % b;
                  d /= b;
                  if (i == 0) break;
                  d *= i;
              printf("%.4d", c + d / 10000);
              c = d % 10000;

        • Re: GPU is not faster than CPU

          Hi notooth,

          Can you tell, where is the CPU implementation and the OpenCL kernel implementation above.

          Please post a copy of your code (as zip file) so that we can reproduce here.

          Please include the following details as well.


          1. Platform - win32 / win64 / lin32 / lin64 or some other?

                Win7 or win vista or Win8.. Similarly for linux, your distribution

          2. Version of driver

          3. CPU or GPU Target?

          4. CPU/GPU details of your hardware

            • Re: GPU is not faster than CPU

              Here is the code:



              Here is my system details:

              Windows 7 64bit

              Driver version 13.1

              CPU Intel Core 2 Duo E4500 2.2Ghz

              GPU AMD Radeon HD 6770

                • Re: GPU is not faster than CPU

                  Hi notooth,

                  I would suggest to attach the code in the forum post itself using advanced editor. Also no need to attach a large number of unnecessary VS files. I have done it here though


                  Now let's go in the code:

                  1. How many total global threads are you creating? It looks like global_threads is equal to stringlength in the code. And the string used has its length as just 1. The commented string is also not more than 10. These are very small numbers, for a GPU having hundreds of stream processors. It is generally recommeded to have 4 times the threads as you have stream processors in your GPU to acheive good occupancy.

                  2. The kernel has a complicated looking while loop, which BTW does not affect the output of the kernel. So I guess that may be just some dummy code. In such scenario, its hard to say, how much of your loops will not be optimized out . So better make sure you somehow use results out of the while loop in the output.

                  3. And yes, assuming nothing is optimized out, you have a big private array of 2800 ints per wok-item. which is horribly slow. The big while loop is run 1000 times by each work-item. Can you explain how are you parallelizing your total work to be done here.

                  4. Probably you can look into some more samples. hello world is only meant to explain the basics of writing opencl programs. Also read AMD OpenCL Programming guide to learn what kind of algorithms can be accelerated on GPUs.

                  1 of 1 people found this helpful
              • Re: GPU is not faster than CPU

                  int r[2800 + 1];


                Your kernel is allocating a huge private array.

                On top of it -- you are not using constant indices to access it.

                So, the compiler will allocate it in global memory (or) so -- which will slow it down very badly.

                • Re: GPU is not faster than CPU



                  Try to use local data store:

                  local int r[2800 + 1];


                  And also restrict printf(); to a specific thread because it got tons of work now interleaving all the printfs of all your threads.

                  if(get_global_id(0)=InspectedThreadID) printf();


                  Also I see this is a test code, and all the threads are calculating the same data, but it does the same at least 64x in paralell (or whatever your number of threads in your NDRange).