AnsweredAssumed Answered

Why run´s this Code faster on the CPU than the GPU

Question asked by carter45 on Feb 26, 2014
Latest reply on Feb 27, 2014 by nou

Hello to everyone,


I am currently trying to get familiar with jocl, and learn the basics.


For that I tried a basic Sample, in which I fill a array representing an Image with shades of blue.

So that every Work-Item has its own intensity value of the blue component.

Here´s the example:


__kernel void sampleKernel(__global float *intensitys, __global float *picture)


            int gid = get_global_id(0);

            int width = 1800;

            int height = 1000;

            for(int j = 0; j < 2000; j ++){

                int position = (height - gid - 1) * width;

                for(int i = 0; i < width; i++){

                     picture[position+i] = 255 * intensitys[gid];





I added the 2000-loop only for more computation time, so that I can benchmark it better. It has no influence on the final image.


My problem is that the execution time on the GPU is longer than on the CPU

I use global_work_size of 1000 for every line of the Image

local_work_size 64 for GPU     executiontime: 540ms

local_work_size 4 for CPU       executiontime: 387ms


I tried several local_work_size´s but the GPU was always slower.


I thought it could be the IO between GPU and CPU but removing the 2000-loop results in nearly 0ms computation times for

both GPU and CPU.


Doubling the loop to 4000 results in double computation time so the IO has no big influence on the computation time


I realy don´t know why, the GPU should with it´s 1000 shaders perform much better than the CPU with its 4 cores.


I appreciate every hint. Thanks for your help in advance!


The code is in the appendix