AnsweredAssumed Answered

Concurrent kernel execution between CPU and GPU

Question asked by hschu on May 6, 2013
Latest reply on May 13, 2013 by himanshu.gautam

Hi. I am trying to verify simple heterogenous computing using a CPU and a GPU using OpenCL. The Kernel function is a simple BLAS level 1 saxpy (single-precision, scalar multplication and vector addition) algorithm, and I assigned "n" numbers of elements to the CPU and "nn-n" to the GPU, where "nn" is the vector length. Moving n variable, I wanted to figure out a splitting point "n" that minimizes whole computational time.


In order to get the ideal splitting point, OpenCL should guarantee a concurrency under heterogeneous system. So I tried to verify that concurrency by testing a simple program as follows.


   CPerfCounter t1;



   // Enqueue to write the target vectors x and y to GPU Global memory.
   clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_x, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), x, 0, NULL, NULL);
   clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), y, 0, NULL, NULL);

   // Enqueue NDRange to CPU
   err = clEnqueueNDRangeKernel(cqCommandQueue_cpu, ckKernel[1], 1, NULL, &GWS2, &LWS2, 0, NULL, NULL);

   // Enqueue NDRange to GPU
   clEnqueueNDRangeKernel(cqCommandQueue_gpu, ckKernel[0], 1, NULL, &GWS, &LWS, 0, NULL, NULL);
   // Enqueue to read the result vector to Host memory
   clEnqueueReadBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), z, 0, NULL, NULL);

I intentionally remove "clFlush(cqCommandQueue_gpu)" since there were no big differences about results. Here is profile information using AMD Profiler. I found out some strang results.


Case1. Not executed in parallel



Case2. Working properly



Case3. Strangely PCI express holds data while CPU computes



How can I analyze these results?

Thanks in advance.


------------- My information

Windows 7 64-bit, VS 2010

CPU : FX 8120

GPU : Radeon 7970