Hi. I am trying to verify simple heterogenous computing using a CPU and a GPU using OpenCL. The Kernel function is a simple BLAS level 1 saxpy (single-precision, scalar multplication and vector addition) algorithm, and I assigned "n" numbers of elements to the CPU and "nn-n" to the GPU, where "nn" is the vector length. Moving n variable, I wanted to figure out a splitting point "n" that minimizes whole computational time.
In order to get the ideal splitting point, OpenCL should guarantee a concurrency under heterogeneous system. So I tried to verify that concurrency by testing a simple program as follows.
CPerfCounter t1;
...
t1.Reset();
t1.Start();
// Enqueue to write the target vectors x and y to GPU Global memory.
clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_x, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), x, 0, NULL, NULL);
clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), y, 0, NULL, NULL);
// Enqueue NDRange to CPU
err = clEnqueueNDRangeKernel(cqCommandQueue_cpu, ckKernel[1], 1, NULL, &GWS2, &LWS2, 0, NULL, NULL);
// Enqueue NDRange to GPU
clEnqueueNDRangeKernel(cqCommandQueue_gpu, ckKernel[0], 1, NULL, &GWS, &LWS, 0, NULL, NULL);
// Enqueue to read the result vector to Host memory
clEnqueueReadBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), z, 0, NULL, NULL);
//clFlush(cqCommandQueue_gpu);
clFlush(cqCommandQueue_cpu);
clFinish(cqCommandQueue_gpu);
clFinish(cqCommandQueue_cpu);
t1.Stop();
I intentionally remove "clFlush(cqCommandQueue_gpu)" since there were no big differences about results. Here is profile information using AMD Profiler. I found out some strang results.
Case1. Not executed in parallel
Case2. Working properly
Case3. Strangely PCI express holds data while CPU computes
How can I analyze these results?
Thanks in advance.
------------- My information
Windows 7 64-bit, VS 2010
CPU : FX 8120
GPU : Radeon 7970