Archives Discussions

jasno · ‎11-30-2011

Hi,

I have the following situation, a machine with 2 AMD GPUs and 1 CPU all of which are presented as devices. I have created queues to all 3 devices in "out-of-order" mode in the same Context. I have 3 kernels (2 OpenCL, one Native). I have 2 memory buffers, the first is created using CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR to create and copy the data in one step.

I pass the first buffer as input to the 2 GPU kernels, and I queue the 2 OpenCL kernels, one to each GPU, with the event from the first passed as a wait event to the second. I queue the Native kernel onto the CPU queue with a wait event of the 2nd of the GPU kernels, I also pass both memory buffers to the Native kernel.

I understood it to be that the Context managed the movement of memory buffers as required and since I have used events to synchronise the 3 kernels I should get the correct result. However, although the kernels start one after the other (I have looked at the start and stop times for the 3 kernel events and the times do not overlap and are in the correct order) the data from the second kernel is not propagated to the third kernel before it executes so it uses out of date data (seemingly the data after just hte first kernel).

If I put a readBuffer command before the 3rd kernel launch so that the data is forced back to the host then all is well, but I thought that OpenCL did this for me without me having to manage it. Have I missunderstood this ?

--

jason

jasno · ‎12-01-2011

OK, so to follow up my own post. I have created an OpenCL version of my NativeKernel and, using the same queue, I have queued the OpenCL kernel instead of the native kernel and used the same set of events to create the dependencies. In this case the answer comes out as I expect it to (i.e. OpenCL manages the movement of the data buffers as I think it is supposed to do).

Does this mean there is a bug in the AMD OpenCL implementation when it comes to using NativeKernels ? I am using version ati-app-sdk/lnx64/2.5/

--

jason

nou · ‎12-01-2011

best what you can do is make a test case which show this behaviour and send it to AMD.

himanshu_gautam · ‎12-03-2011

jasno,

Please attch a testcase.

jasno · ‎12-07-2011

Hi,

sorry for disappearing, just got back to this. I have created a stripped down version that I think still shows the problem. Below is the ouput I get, the first set is incorrect using the native kernel, the second set is what I would have exected using the OpenCL kernel.

./simple
Data passed to native kernel
n, x = 63999996 11.811909
n, x = 63999997 11.811909
n, x = 63999998 11.811909
n, x = 63999999 11.811909
Data from b_d
n, x = 63999995 23.623817
n, x = 63999996 23.623817
n, x = 63999997 23.623817
n, x = 63999998 23.623817
n, x = 63999999 23.623817
Data from a_d, should be half of data from b_d
n, x = 63999995 16.811909
n, x = 63999996 16.811909
n, x = 63999997 16.811909
n, x = 63999998 16.811909
n, x = 63999999 16.811909
Kernel 2 started at 0, finished at 2271606240
Kernel 1 started at 3286791740, finished at 3523847240
Kernel 3 started at 3524036240, finished at 4013161240

./simple
Data from b_d
n, x = 63999995 33.623817
n, x = 63999996 33.623817
n, x = 63999997 33.623817
n, x = 63999998 33.623817
n, x = 63999999 33.623817
Data from a_d, should be half of data from b_d
n, x = 63999995 16.811909
n, x = 63999996 16.811909
n, x = 63999997 16.811909
n, x = 63999998 16.811909
n, x = 63999999 16.811909
Kernel 2 started at 0, finished at 2271096840
Kernel 1 started at 3286895680, finished at 3524155840
Kernel 3 started at 3830337840, finished at 3962091840

--

jason

#include <stdlib.h> #include <stdio.h> #include <CL/opencl.h> #define DEVID 0 #define DEVID2 1 // Program source const char* KernelSource = " __kernel void simple( __global float* A) { \n" " int tid = get_global_id(0); \n" " A[tid] = A[tid] + 5; \n" "} \n" " \n" "__kernel void simple2(__global float *A, int nLoops) { \n" " \n" " int tid = get_global_id(0); \n" " for (int i = 0; i < nLoops; i++) { \n" " A[tid] = sqrt(A[tid]) + (0.5 * A[tid]) + log(A[tid]); \n" " } \n" "} \n" "__kernel void simple3(__global float *A, __global float *B, int entries) { \n" " \n" " int tid = get_global_id(0); \n" " if (tid < entries) \n" " B[tid] = 2 * A[tid]; \n" " \n" "} \n"; // native kernel data structure struct myNativeKernelData { float *inData; float *outData; int size; }; // native kernel void myNativeKernel(void *data) { struct myNativeKernelData *NKData = (struct myNativeKernelData*)data; printf("Data passed to native kernel\n"); for (int i = 0; i< NKData->size; i++) { if (i > (NKData->size -5)) printf("n, x = %d %f\n",i,NKData->inData); NKData->outData = 2*NKData->inData; } }// Print the kernel times void printKernelTimes(cl_event event,char *eventName) { static cl_ulong baseTime = -1; cl_ulong sTime,eTime; cl_int err = clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&sTime,NULL); err |= clGetEventProfilingInfo(event,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&eTime,NULL); // Set the base time and then subtract it to make the numbers easier to read if (baseTime == -1) baseTime = sTime; if (err != CL_SUCCESS) { printf("Failed to get times for %s\n",eventName); } else { printf("%s started at %ld, finished at %ld\n",eventName,(sTime - baseTime),(eTime - baseTime)); } } // Check error code void OCLError(cl_int err, int line) { if (err != CL_SUCCESS) { printf("Error at line %d\n",line); exit(1); } } int main() { // OpenCL Platform, Device etc vars cl_device_id *Devices; // OpenCL device cl_context Context; // OpenCL context static cl_platform_id Platform; // OpenCL platform cl_program Program; cl_command_queue *CommandQs; // OpenCL command queues cl_int err; // Get platform IDs err = clGetPlatformIDs( 1, &Platform, NULL); OCLError(err,__LINE__); // get all devices cl_uint nDevices = -1; err = clGetDeviceIDs(Platform, CL_DEVICE_TYPE_ALL, 0, NULL, &nDevices); OCLError(err,__LINE__); if (nDevices < 3) { printf("Num Devices less than 3\n"); exit(1); } Devices = (cl_device_id *)malloc(nDevices * sizeof(cl_device_id)); err = clGetDeviceIDs(Platform, CL_DEVICE_TYPE_ALL, nDevices, Devices, NULL); OCLError(err,__LINE__); // Create a compute context Context = clCreateContext(0, nDevices, Devices, NULL, NULL, &err); OCLError(err,__LINE__); // Create a command queue CommandQs = (cl_command_queue *)malloc(nDevices*sizeof(cl_command_queue)); for (int i = 0; i < nDevices; i++) { CommandQs = clCreateCommandQueue(Context, Devices, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE , &err); OCLError(err,__LINE__); } // Create the compute program from the source buffer Program = clCreateProgramWithSource(Context, 1, (const char **) & KernelSource, NULL, &err); OCLError(err,__LINE__); err = clBuildProgram(Program, nDevices, Devices, NULL, NULL, NULL); OCLError(err,__LINE__); // Create the compute kernel from the program cl_kernel Kernel = clCreateKernel(Program, "simple", &err); OCLError(err,__LINE__); // kernel data vars float *a_h; // host memory used for original data and return data cl_mem a_d; // device memory used for input/output cl_mem b_d; // device memory used for output size_t entries = 64000000; // number of entries in the array size_t local = 1; int count = 100; a_h = (float *)malloc(entries*sizeof(float)); for (int i =0; i < entries; i++) a_h = (1 << 25) + 6; a_d = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(float) * entries, a_h, &err); OCLError(err,__LINE__); b_d = clCreateBuffer(Context, CL_MEM_WRITE_ONLY, sizeof(float) * entries, NULL, &err); OCLError(err,__LINE__); cl_kernel Kernel2 = clCreateKernel(Program, "simple2", &err); OCLError(err,__LINE__); clSetKernelArg(Kernel, 0, sizeof(cl_mem), &a_d); OCLError(err,__LINE__); clSetKernelArg(Kernel2, 0, sizeof(cl_mem), &a_d); OCLError(err,__LINE__); clSetKernelArg(Kernel2, 1, sizeof(int), &count); OCLError(err,__LINE__); clGetKernelWorkGroupInfo (Kernel,Devices[DEVID2],CL_KERNEL_WORK_GROUP_SIZE,sizeof(size_t),&local,NULL); cl_event k2Event[1]; cl_event k1Event[1]; err = clEnqueueNDRangeKernel(CommandQs[DEVID], Kernel2, 1, NULL, &entries, NULL, 0, NULL, k2Event); OCLError(err, __LINE__); err = clEnqueueNDRangeKernel(CommandQs[DEVID2], Kernel, 1, NULL, &entries, &local, 1, k2Event, k1Event); OCLError(err, __LINE__); // // Now try enqueueing a native kernel // // Find a queue that is associated with a CPU device (assumes 1 queue per device) cl_command_queue CPUQueue = NULL; for (int i = 0; i < nDevices; i++) { cl_device_id qDevID; cl_device_type DevType; cl_device_exec_capabilities DevExecCap; // get the device attached to this queue err = clGetCommandQueueInfo (CommandQs ,CL_QUEUE_DEVICE, sizeof(qDevID),&qDevID,NULL); OCLError(err, __LINE__); // get the type of this device err = clGetDeviceInfo (qDevID, CL_DEVICE_TYPE, sizeof(DevType), &DevType,NULL); OCLError(err, __LINE__); err = clGetDeviceInfo (qDevID, CL_DEVICE_EXECUTION_CAPABILITIES , sizeof(DevExecCap), &DevExecCap,NULL); OCLError(err, __LINE__); // compare the device_type for this device if ((DevType == CL_DEVICE_TYPE_CPU) && ((DevExecCap == CL_EXEC_NATIVE_KERNEL) || (DevExecCap == CL_EXEC_NATIVE_KERNEL|CL_EXEC_KERNEL))) { CPUQueue = CommandQs; } } cl_event k3Event[1]; cl_event eventLists[2]; eventLists[0] = k2Event[0]; eventLists[1] = k1Event[0]; if (CPUQueue != NULL) { #ifndef NATIVEK cl_kernel Kernel3 = clCreateKernel(Program, "simple3", &err); OCLError(err, __LINE__); clSetKernelArg(Kernel3, 0, sizeof(cl_mem), &a_d); OCLError(err,__LINE__); clSetKernelArg(Kernel3, 1, sizeof(cl_mem), &b_d); OCLError(err,__LINE__); clSetKernelArg(Kernel3, 2, sizeof(int), &entries); OCLError(err,__LINE__); err = clEnqueueNDRangeKernel(CPUQueue, Kernel3, 1, NULL, &entries, &local, 2, eventLists, k3Event); OCLError(err, __LINE__); #else // Allocate and populate the data structure for the nativeKernel struct myNativeKernelData NKData; NKData.size = entries; cl_mem memList[2]; memList[0] = a_d; memList[1] = b_d; void *memLocs[2]; memLocs[0] = &(NKData.inData); memLocs[1] = &(NKData.outData); // Enqueue the kernel err = clEnqueueNativeKernel (CPUQueue, &myNativeKernel, (void *)&NKData, sizeof(NKData),2,memList, (const void **)memLocs, 2, eventLists, k3Event); OCLError(err, __LINE__); #endif err = clEnqueueReadBuffer(CPUQueue, b_d, CL_TRUE, 0, sizeof(float) * entries, a_h, 1, k3Event, NULL); OCLError(err, __LINE__); printf("Data from b_d\n"); for (int i = (entries - 5); i < entries; i++) printf("n, x = %d %f\n",i,a_h); } else { printf("No CPU device found\n"); } err = clEnqueueReadBuffer(CommandQs[DEVID], a_d, CL_TRUE, 0, sizeof(float) * entries, a_h, 1, k1Event, NULL); OCLError(err, __LINE__); printf("Data from a_d, should be half of data from b_d\n"); for (int i = (entries - 5); i < entries; i++) printf("n, x = %d %f\n",i,a_h); printKernelTimes(k2Event[0],"Kernel 2"); printKernelTimes(k1Event[0],"Kernel 1"); printKernelTimes(k3Event[0],"Kernel 3"); // tidy up free(Devices); free(CommandQs); free(a_h); clReleaseMemObject(a_d); clReleaseMemObject(b_d); }

Archives Discussions

Events, Multiple Queues and Memory Buffers