cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

carter45
Journeyman III

Why run´s this Code faster on the CPU than the GPU

Hello to everyone,

I am currently trying to get familiar with jocl, and learn the basics.

For that I tried a basic Sample, in which I fill a array representing an Image with shades of blue.

So that every Work-Item has its own intensity value of the blue component.

Here´s the example:

__kernel void sampleKernel(__global float *intensitys, __global float *picture)

        {

            int gid = get_global_id(0);

            int width = 1800;

            int height = 1000;

            for(int j = 0; j < 2000; j ++){

                int position = (height - gid - 1) * width;

                for(int i = 0; i < width; i++){

                     picture[position+i] = 255 * intensitys[gid];

                }

            }

        }

I added the 2000-loop only for more computation time, so that I can benchmark it better. It has no influence on the final image.

My problem is that the execution time on the GPU is longer than on the CPU

I use global_work_size of 1000 for every line of the Image

local_work_size 64 for GPU     executiontime: 540ms

local_work_size 4 for CPU       executiontime: 387ms

I tried several local_work_size´s but the GPU was always slower.

I thought it could be the IO between GPU and CPU but removing the 2000-loop results in nearly 0ms computation times for

both GPU and CPU.

Doubling the loop to 4000 results in double computation time so the IO has no big influence on the computation time

I realy don´t know why, the GPU should with it´s 1000 shaders perform much better than the CPU with its 4 cores.

I appreciate every hint. Thanks for your help in advance!

The code is in the appendix

0 Likes
1 Solution
ravkum
Staff

Hi,

The kernel is memory-bound, not compute-bound.

The reason 2000 loop count doesn't change performance numbers is that most likely the compiler is optimizing it. It sees that only the last iteration results are getting saved in the output, everything else is getting overwritten or not being used, so that loop is optimized. You can try changing

     picture[position+i] = 255 * intensitys[gid];

to

     picture[position+i] += 255 * intensitys[gid];


Which will make sure that all the loop iterations matter.


Have you tried using the global-size 1800 X 1000 instead of 1000? I mean one work-item for every pixel?


Also you may want to read AMD APP Programming guide to learn more about OpenCL Optimization.


Regards,

View solution in original post

0 Likes
3 Replies
ravkum
Staff

Hi,

The kernel is memory-bound, not compute-bound.

The reason 2000 loop count doesn't change performance numbers is that most likely the compiler is optimizing it. It sees that only the last iteration results are getting saved in the output, everything else is getting overwritten or not being used, so that loop is optimized. You can try changing

     picture[position+i] = 255 * intensitys[gid];

to

     picture[position+i] += 255 * intensitys[gid];


Which will make sure that all the loop iterations matter.


Have you tried using the global-size 1800 X 1000 instead of 1000? I mean one work-item for every pixel?


Also you may want to read AMD APP Programming guide to learn more about OpenCL Optimization.


Regards,

0 Likes

Hi,

thanks for your reply.

Changing the dimension from 

long global_work_size[] = new long[]{1000};

to

    long global_work_size[] = new long[2];
    global_work_size[0] = 1800;
   

global_work_size[1] = 1000;

has done the work. It runs like 70 to 100 times faster on the GPU now.

Changing to

picture[position+i] += 255 * intensitys[gid];

the GPU was still 40% slower.

I think the 2000-loop had influence on the performance because if I double it to 4000 the computation time doubled also.

Thanks for the Help .

But I still ask myself why the first solution didn´t work. Having 1800*1000 work items is truly more parallel than only 1000, but still the GPU should perform better with only 1000!?

0 Likes

modern GPU like radeon R9 290 have 2816 compute units. so you need at least 2816 work-items to fully utilize GPU. also GPU have higher latency to start a task compared to CPU. that mean you must have some longer running task so starting latency is not major part of execution time.

0 Likes