3 Replies Latest reply on Feb 27, 2014 11:32 AM by nou

    Why run´s this Code faster on the CPU than the GPU

    carter45

      Hello to everyone,

       

      I am currently trying to get familiar with jocl, and learn the basics.

       

      For that I tried a basic Sample, in which I fill a array representing an Image with shades of blue.

      So that every Work-Item has its own intensity value of the blue component.

      Here´s the example:

       

      __kernel void sampleKernel(__global float *intensitys, __global float *picture)

              {

                  int gid = get_global_id(0);

                  int width = 1800;

                  int height = 1000;

                  for(int j = 0; j < 2000; j ++){

                      int position = (height - gid - 1) * width;

                      for(int i = 0; i < width; i++){

                           picture[position+i] = 255 * intensitys[gid];

                      }

                  }

              }

       

      I added the 2000-loop only for more computation time, so that I can benchmark it better. It has no influence on the final image.

       

      My problem is that the execution time on the GPU is longer than on the CPU

      I use global_work_size of 1000 for every line of the Image

      local_work_size 64 for GPU     executiontime: 540ms

      local_work_size 4 for CPU       executiontime: 387ms

       

      I tried several local_work_size´s but the GPU was always slower.

       

      I thought it could be the IO between GPU and CPU but removing the 2000-loop results in nearly 0ms computation times for

      both GPU and CPU.

       

      Doubling the loop to 4000 results in double computation time so the IO has no big influence on the computation time

       

      I realy don´t know why, the GPU should with it´s 1000 shaders perform much better than the CPU with its 4 cores.

       

      I appreciate every hint. Thanks for your help in advance!

       

      The code is in the appendix

        • Re: Why run´s this Code faster on the CPU than the GPU
          ravkum

          Hi,

           

          The kernel is memory-bound, not compute-bound.

           

          The reason 2000 loop count doesn't change performance numbers is that most likely the compiler is optimizing it. It sees that only the last iteration results are getting saved in the output, everything else is getting overwritten or not being used, so that loop is optimized. You can try changing

               picture[position+i] = 255 * intensitys[gid];

           

          to

           

               picture[position+i] += 255 * intensitys[gid];


          Which will make sure that all the loop iterations matter.


          Have you tried using the global-size 1800 X 1000 instead of 1000? I mean one work-item for every pixel?


          Also you may want to read AMD APP Programming guide to learn more about OpenCL Optimization.


          Regards,

            • Re: Why run´s this Code faster on the CPU than the GPU
              carter45

              Hi,

               

              thanks for your reply.

               

              Changing the dimension from 

              long global_work_size[] = new long[]{1000};

              to

                  long global_work_size[] = new long[2];
                  global_work_size[0] = 1800;
                 

              global_work_size[1] = 1000;

              has done the work. It runs like 70 to 100 times faster on the GPU now.

               

              Changing to

              picture[position+i] += 255 * intensitys[gid];

              the GPU was still 40% slower.

              I think the 2000-loop had influence on the performance because if I double it to 4000 the computation time doubled also.

               

              Thanks for the Help .

               

              But I still ask myself why the first solution didn´t work. Having 1800*1000 work items is truly more parallel than only 1000, but still the GPU should perform better with only 1000!?