Archives Discussions

chrisgregg · ‎11-17-2009

I've written a kernel that I would like to run on both the GPU and the CPU concurrently. First, I launch the GPU kernel, and then immediately launch the CPU kernel. The timers I have indicate that the GPU kernel doesn't block and the CPU kernel is indeed queued up immediately.

Unfortunately, it seems that the kernels don't run concurrently, or else I'm timing something incorrectly. What I'd like to see is the following (for instance):

Time for a single kernel to run on the GPU = 5 sec

Time for a single kernel to run on the CPU = 8 sec

Time for both, running concurrently = around 8 sec (but I'm seeing 13 sec)

I guess my question is about whether or not clEnqueueNDRangeKernel() will forward a kernel on to a second processor in a two-processor system if the first processor is already running a kernel. Thanks!

Edit: A related question would be: is it possible to have two queues, or is there only one queue because there is only one global clEnqueueNDRangeKernel() ?

nou · ‎11-18-2009

specification say command_queue is a valid command-queue. The kernel will be queued for execution on the device associated with command_queue.
so you need two command queues. for each device one.

chrisgregg · ‎11-18-2009

Thanks for the reply. I did make two command queues, one for each device. I've tried the experiment a couple of different ways: first, I implemented both kernel calls within one program, and that led to the results I mentioned above (the cpu kernel seems to wait for the gpu kernel to be done before running).

Second, I ran a shell script that ran two instances of the same program, but with a flag to denote which processor. The first line of the script runs a gpu kernel, and the second script runs a cpu kernel. The results do show that the cpu kernel is running concurrently, but there definitely some contention between the two--I will investigate to see if it is involving memory, or something else.

chrisgregg · ‎11-18-2009

Further information: I checked the

CL_PROFILING_COMMAND_QUEUED, CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START, and CL_PROFILING_COMMAND_END

for each kernel, and the second kernel (the CPU kernel) is indeed waiting until the first kernel (the GPU kernel) finishes before it gets submitted. Both end up in the queue immediately (and there are two command queues), but the second doesn't get submitted until the first finishes.

AndreasStahl · ‎11-18-2009

Could you post the scheduling portion of your host code?

chrisgregg · ‎11-18-2009

Originally posted by: AndreasStahl Could you post the scheduling portion of your host code?

Okay, I'll try to upload the part that matters. The code is based on the sample code from ATI, so there is a lot of stuff going on.

Here's what I do: I have a class, called EP2, which is derived from SDKSample. I've broken the runCLKernels() function into two functions, the first of which launches the kernels. I've attached both functions.

In main(), I instantiate two classes, clEP2 and clEP2b. Then I initialize each of them, setup each of them, and run each of them, which calls the runCLKernels1() for each kernel. Then, I call runCLKernels2() for each kernel, which waits for each kernel to finish, then waits for the read buffers to finish, and returns. Finally, I run the cleanup routines. Cheers!

int EP2::runCLKernels1(void) { cl_int status; size_t globalThreads[1]= {numberOfFloats}; size_t localThreads[1] = {blockSize}; status = clGetKernelWorkGroupInfo( kernel, devices[0], CL_KERNEL_LOCAL_MEM_SIZE, sizeof(cl_ulong), &usedLocalMemory, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetKernelWorkGroupInfo failed.(usedLocalMemory)")) return SDK_FAILURE; availableLocalMemory = totalLocalMemory - usedLocalMemory; neededLocalMemory = blockSize*blockSize*sizeof(cl_float); if(neededLocalMemory > availableLocalMemory) { std::cout << "Unsupported: Insufficient local memory on device." << std::endl; return SDK_SUCCESS; } // std::cout << "globalThreads:" << globalThreads[0] << std::endl // << "localThreads:" << localThreads[0] << std::endl; if(localThreads[0] > maxWorkItemSizes[0] || localThreads[0] > maxWorkGroupSize ) { std::cout << "Unsupported: Device does not support requested number of work items."<<std::endl; return SDK_SUCCESS; } /*** Set appropriate arguments to the kernel ***/ /* 1st kernel argument - output */ status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *)&outputBuffer); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (outputBuffer)")) return SDK_FAILURE; /* 2nd kernel argument - input */ status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *)&inputBuffer); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (inputBuffer)")) return SDK_FAILURE; /* 3rd kernel argument - block of blockSize x blockSize floats */ status = clSetKernelArg( kernel, 2, sizeof(cl_float)*blockSize*blockSize, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (block)")) return SDK_FAILURE; /* 4th kernel argument - numberOfFloats */ status = clSetKernelArg( kernel, 3, sizeof(cl_int), (void*)&numberOfFloats); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (numberOfFloats)")) return SDK_FAILURE; /* 5th kernel argument - blockSize */ status = clSetKernelArg( kernel, 4, sizeof(cl_int), (void*)&blockSize); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (blockSize)")) return SDK_FAILURE; /* * Enqueue a kernel run call. */ status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clEnqueueNDRangeKernel failed.")) return SDK_FAILURE; } int EP2::runCLKernels2(void) { long long kernelsEndTime; cl_int status; /* wait for the kernel call to finish execution */ status = clWaitForEvents(1, &events[0]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clWaitForEvents failed0.")) return SDK_FAILURE; if(timing) { status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_START, sizeof(long long), &kernelsStartTime, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetEventProfilingInfo failed.")) return SDK_FAILURE; status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_END, sizeof(long long), &kernelsEndTime, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetEventProfilingInfo failed.")) return SDK_FAILURE; cl_ulong kernelsQueuedTime,kernelsSubmitTime; status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_QUEUED, sizeof(long long), &kernelsQueuedTime, NULL); status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_SUBMIT, sizeof(long long), &kernelsSubmitTime, NULL); std::cout << "Queued: " << kernelsQueuedTime/1e9 << " Submitted: " << kernelsSubmitTime/1e9 << " Start: " << kernelsStartTime/1e9 << " End: " << kernelsEndTime/1e9 << std::endl; /* Compute total time (also convert from nanoseconds to seconds) */ totalKernelTime = (double)(kernelsEndTime - kernelsStartTime)/1e9; } clReleaseEvent(events[0]); /* Enqueue readBuffer*/ status = clEnqueueReadBuffer( commandQueue, outputBuffer, CL_TRUE, 0, numberOfFloats * sizeof(cl_float), output, 0, NULL, &events[1]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clEnqueueReadBuffer failed.")) return SDK_FAILURE; /* Wait for the read buffer to finish execution */ status = clWaitForEvents(1, &events[1]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clWaitForEvents failed1.")) return SDK_FAILURE; clReleaseEvent(events[1]); return SDK_SUCCESS; }

richi · ‎12-05-2009

I'm having a similar problem as the one described here.

I have 2 ATI 4870 video cards in my computer, and I need to do some time consuming calculation. I'm using pyopencl, and both my video cards are detected. In my program, I'm creating one context for each device, and one queue for each context. Using the same input data, when I run a program in both devices, the complete process is taking twice as long as when I run it in just one (I'm expecting one device calculating, or 2 devices to take the same amount of time). It looks that one device is waiting for the other to finish before starting to process.

Is there any information available on how to do this?

Thanks

Archives Discussions

Possible to run OpenCL code on GPU and CPU concurrently?