cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

shingoxlf
Journeyman III

useing singe opencl thread on AMD CPU running slower than sequential code

Hi, I was trying to parallelize the optical flow algorithm using OpenCL.

I have the sequential version running on CPU, and a OpenCL version but using only one thread to run on a AMD CPU.

However, the time spending on kernel for OpenCL is like 10 times more than the sequential one. They are doing exactly the same thing.

Can anyone tell me why the OpenCL is so slow??

0 Likes
5 Replies
timattox
Adept I

You have not given us enough detail to comment on why your OpenCL code is running slowly on a CPU.  However, here are some generalities as best I know them:

  1. Running OpenCL on a CPU is not it's primary target market, so it has not gotten as much attention for optimizations.
  2. Running on a CPU, it might have to emulate some hardware features (such as the functions for working with image buffers)
  3. The memory system of a typical CPU has very different performance tradeoffs than that of a GPU, so some algorithm/data-structure constructs you would write for a GPU won't run as well on a CPU (and vice versa)
Please post some of your code so the community could give more specific feedback.  OpenCL is meant to be functionaly compatible across various architectures (Cell, CPU, GPU), but there is no promise, or even expectation, that OpenCL code would be performance portable.  At least for my code I see fairly nice performance portability across GPU vendors, and that is the best I could hope for.  And several things I've done to my OpenCL code as I've developed it, made it run much faster on a GPU, while simutaniously made it slower when run on a CPU.
For instance, my original serial C code runs about twice as fast as my current GPU-tuned OpenCL code, if the OpenCL code is run with a single CPU thread. However, once I give my OpenCL code more than one CPU core, it runs faster than the original serial CPU C code, and my OpenCL code scales fairly efficiently to 12 to 16 CPU threads.  However, it runs even better on real GPU hardware.


0 Likes

Hi, I just realize the time comsuming part is memory copy, the kernel is just actually running really fast. Here is the main code:

 

for(int l=LEVELS-1; l >= 0; l--) {

      int memcpy_time;

      memcpy_time = getTimeNow();

size_t volumeSize[] = {pyr_w, pyr_h};

float *temp1 = (float *)malloc(sizeof(float)*pyr_w*pyr_h);

float *temp2 = (float *)malloc(sizeof(float)*pyr_w*pyr_h);

errcode = clEnqueueReadBuffer(clCommandQue, gpu_img_pyramid_prev, CL_TRUE, 0, sizeof(float)*pyr_w*pyr_h, temp1, 0, NULL, NULL);

if(errcode != CL_SUCCESS) printf("Error in reading gpu_img_pyramid_prev %d\n",l);

errcode = clEnqueueReadBuffer(clCommandQue, gpu_img_pyramid_cur, CL_TRUE, 0, sizeof(float)*pyr_w*pyr_h, temp2, 0, NULL, NULL);

if(errcode != CL_SUCCESS) printf("Error in reading gpu_img_pyramid_cur %d\n",l);

texRef_pyramid_prev = initTexture2D(temp1,volumeSize);

texRef_pyramid_cur  = initTexture2D(temp2,volumeSize);

errcode = clFlush(clCommandQue);

free(temp1);

free(temp2);

printf("Tracking level %d memcpy time: %d ms\n",l,getTimeNow()-memcpy_time);

char initGuess = (l == LEVELS-1);

int l_temp = l;

errcode =  clSetKernelArg(clKernel_track, 0, sizeof(int), (void *)&w);

errcode |= clSetKernelArg(clKernel_track, 1, sizeof(int), (void *)&h);

errcode |= clSetKernelArg(clKernel_track, 2, sizeof(int), (void *)&pyr_w);

errcode |= clSetKernelArg(clKernel_track, 3, sizeof(int), (void *)&pyr_h); 

errcode |= clSetKernelArg(clKernel_track, 4, sizeof(float), (void *)&scaling);

errcode |= clSetKernelArg(clKernel_track, 5, sizeof(int), (void *)&l_temp);

errcode |= clSetKernelArg(clKernel_track, 6, sizeof(char), (void *)&initGuess);

errcode |= clSetKernelArg(clKernel_track, 7, sizeof(cl_mem), (void *)&gpu_dx);

errcode |= clSetKernelArg(clKernel_track, 8, sizeof(cl_mem), (void *)&gpu_dy);

errcode |= clSetKernelArg(clKernel_track, 9, sizeof(cl_mem), (void *)&gpu_status);

errcode |= clSetKernelArg(clKernel_track, 10, sizeof(cl_mem), (void *)&texRef_pyramid_prev);

errcode |= clSetKernelArg(clKernel_track, 11, sizeof(cl_mem), (void *)&texRef_pyramid_cur);

errcode |= clSetKernelArg(clKernel_track, 12, sizeof(cl_sampler), (void *)&volumeSamplerLinear);

if(errcode != CL_SUCCESS) printf("Error in seting arguments track");

// Execute the OpenCL kernel

int kernel_time = getTimeNow();

errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel_track, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);

if(errcode != CL_SUCCESS) printf("Error in launching kernel track");

errcode = clFlush(clCommandQue);

printf("Tracking level %d : %d ms\n",l,getTimeNow() - kernel_time);

    }



 

 

The timing result for this part of code is 😞 Time for texture creation: is from initTexture2D)

 

 

Time for texture creation: 0 ms

Time for texture creation: 0 ms

Tracking level 2 memcpy time: 0 ms

Tracking level 2 : 0 ms

Time for texture creation: 1 ms

Time for texture creation: 0 ms

Tracking level 1 memcpy time: 18338 ms

Tracking level 1 : 0 ms

Time for texture creation: 3 ms

Time for texture creation: 3 ms

Tracking level 0 memcpy time: 11615 ms

Tracking level 0 : 0 ms

Tracking: 29966 ms

 

So I am wondering if there is anyway to reduce this memory copy time?



 

Originally posted by: timattox You have not given us enough detail to comment on why your OpenCL code is running slowly on a CPU.  However, here are some generalities as best I know them:

 

  1. Running OpenCL on a CPU is not it's primary target market, so it has not gotten as much attention for optimizations.
  2. Running on a CPU, it might have to emulate some hardware features (such as the functions for working with image buffers)
  3. The memory system of a typical CPU has very different performance tradeoffs than that of a GPU, so some algorithm/data-structure constructs you would write for a GPU won't run as well on a CPU (and vice versa)


Please post some of your code so the community could give more specific feedback.  OpenCL is meant to be functionaly compatible across various architectures (Cell, CPU, GPU), but there is no promise, or even expectation, that OpenCL code would be performance portable.  At least for my code I see fairly nice performance portability across GPU vendors, and that is the best I could hope for.  And several things I've done to my OpenCL code as I've developed it, made it run much faster on a GPU, while simutaniously made it slower when run on a CPU.
For instance, my original serial C code runs about twice as fast as my current GPU-tuned OpenCL code, if the OpenCL code is run with a single CPU thread. However, once I give my OpenCL code more than one CPU core, it runs faster than the original serial CPU C code, and my OpenCL code scales fairly efficiently to 12 to 16 CPU threads.  However, it runs even better on real GPU hardware.

 

0 Likes

Originally posted by: shingoxlf Hi, I just realize the time comsuming part is memory copy, the kernel is just actually running really fast. Here is the main code:

 So I am wondering if there is anyway to reduce this memory copy time? 

 

 

Please go through the section 4 of OpenCL programming guide.

0 Likes

What puzzled me is that the memcopy of 300 X 300 float coast 18338ms which showed in the running results. Do you have any clue about this expensive memory operation?

Originally posted by: genaganna
Originally posted by: shingoxlf Hi, I just realize the time comsuming part is memory copy, the kernel is just actually running really fast. Here is the main code:

 

 So I am wondering if there is anyway to reduce this memory copy time? 

 

 

 

 

 

 

Please go through the section 4 of OpenCL programming guide.

 



0 Likes

Your host code isn't giving you useful numbers: you need to use clFinish() if you want to completely separate the timing intervals and wait for quiescence before continuing.  Since the only synchronisation point is actually the blocking memory copies (which perform an implicit clFinish()) - that is the only thing that will take any time.

And just what are you doing with the image pyramids?  Reading them from arrays to the host, then copying them back to the device as a texture?  Weird ...

Even on the CPU you might need a different design for the implementation: but you don't give enough info to suggest what might be wrong.

0 Likes