cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

himanshu_gautam
Grandmaster

maxworkitems for clEnqueueNDRangeKernel(...)

oh.

That is a bug.

Thanks for pointing it out. This will be fixed in next SDK release.

0 Likes
atata
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

Well, I tryed to run my simple program from different computers and everywhere it crashes if I set globalThreads[0] > 256. I removed success checks some other stuff (printing results, time checking etc) from the code - could someone please take a look at this and say what can be the problem? This is just a bit modified 'Template' example from amd app sdk; there are no functions except starting and releasing kernel ones (well, only one function verifying result), and the kernel code is pretty simple. This program takes two vectors input1, input2 (both of lenght = 2*width, width is given) and some integer multiplier and returns resulting vector output = a*input1+input2. In this attached code first goes .cpp source code, then .cl kernel code, then .hpp variables definition. I am really confused with this because I didnt change much from 'Template' example - maybe, I did something wrong setting kernel arguments or something. This code works fine with GlobalThreads[0] <= 256 (I always set localThreads to 256 also).

Thanks much.

//Template.cpp include "Template.hpp" int initializeHost(void) { width = 2000000; // vectors dimensions = 2 * width input = NULL; input2 = NULL; output = NULL; multiplier = 2; cl_uint sizeInBytes = 2 * width * sizeof(cl_uint); input = (cl_uint *) malloc(sizeInBytes); input2 = (cl_uint *) malloc(sizeInBytes); output = (cl_uint *) malloc(sizeInBytes); // input values for(cl_uint i = 0; i < 2 * width; i++) { input = i; input2 = 3*i; } return 0; } std::string convertToString(const char *filename) { //some code converting file text to string } int initializeCL(void) { cl_int status = 0; size_t deviceListSize; cl_uint numPlatforms; cl_platform_id platform = NULL; status = clGetPlatformIDs(0, NULL, &numPlatforms); if(numPlatforms > 0) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); for(unsigned int i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms, CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL); platform = platforms; } delete platforms; } cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status); status = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); devices = (cl_device_id *)malloc(deviceListSize); status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); commandQueue = clCreateCommandQueue( context, devices[0], 0, &status); inputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, input, &status); input2Buffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, input2, &status); outputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, output, &status); const char * filename = "Template_Kernels.cl"; std::string sourceStr = convertToString(filename); const char * source = sourceStr.c_str(); size_t sourceSize[] = { strlen(source) }; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status); status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); kernel = clCreateKernel(program, "templateKernel", &status); return 0; } int runCLKernels(void) { cl_int status; cl_uint maxDims; cl_event events[2]; size_t globalThreads[1] = {256}; size_t localThreads[1] = {256}; status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *)&outputBuffer); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *)&inputBuffer); status = clSetKernelArg( kernel, 2, sizeof(cl_mem), (void *)&input2Buffer); status = clSetKernelArg( kernel, 3, sizeof(cl_uint), (void *)&multiplier); status = clSetKernelArg( kernel, 4, sizeof(cl_uint), (void *)&width); status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); status = clWaitForEvents(1, &events[0]); status = clReleaseEvent(events[0]); status = clEnqueueReadBuffer( commandQueue, outputBuffer, CL_TRUE, 0, 2 * width * sizeof(cl_uint), output, 0, NULL, &events[1]); status = clWaitForEvents(1, &events[1]); status = clReleaseEvent(events[1]); return 0; } int cleanupCL(void) { cl_int status; status = clReleaseKernel(kernel); status = clReleaseProgram(program); status = clReleaseMemObject(inputBuffer); status = clReleaseMemObject(outputBuffer); status = clReleaseCommandQueue(commandQueue); status = clReleaseContext(context); return 0; } void cleanupHost(void) { if(input != NULL) { free(input); input = NULL; } if(output != NULL) { free(output); output = NULL; } if(devices != NULL) { free(devices); devices = NULL; } } void verify() { bool passed = true; for(unsigned long i = 0; i < 2 * width; ++i) if(input * multiplier + input2 != output) passed = false; if(passed == true) std::cout << "Passed!\n"; else std::cout << "Failed!\n"; } int main(int argc, char * argv[]) { if(initializeHost()==1) return 1; if(initializeCL()==1) return 1; if(runCLKernels()==1) return 1; verify(); if(cleanupCL()==1) return 1; cleanupHost(); getchar(); return 0; } //Template_Kernels.cl __kernel void templateKernel(__global unsigned int * output, __global unsigned int * input, __global unsigned int * input2, const unsigned int multiplier, const unsigned int width) { uint tid = get_global_id(0); uint i; //this is code for 256 threads; for correct results for more threads 256 must be changed to number of threads for (i = tid * (width * 2 ) / 256 ; i < ( tid + 1 ) * (width * 2 ) / 256; i++) output = multiplier * input + input2; } //Template.hpp #ifndef TEMPLATE_H_ #define TEMPLATE_H_ #include <CL/cl.h> #include <string.h> #include <cstdlib> #include <iostream> #include <string> #include <fstream> #include "windows.h" #include<stdio.h> cl_uint *input; cl_uint *input2; cl_uint *output; cl_uint multiplier; cl_ulong width; cl_mem inputBuffer; cl_mem input2Buffer; cl_mem outputBuffer; cl_context context; cl_device_id *devices; cl_command_queue commandQueue; cl_program program; cl_kernel kernel; int initializeCL(void); std::string convertToString(const char * filename); int runCLKernels(void); int cleanupCL(void); void cleanupHost(void); #endif /* #ifndef TEMPLATE_H_ */

0 Likes
richeek_arya
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

Try running it with localThreads = NULL in clENqueueNDRangeKernel function...like this:

status = clEnqueueNDRangeKernel(
              commandQueue,
                 kernel,
                 1,
                 NULL,
                 globalThreads,
                 NULL,  
             0,
                 NULL,
                 &events[0]);

 

0 Likes
atata
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

richeek.arya, thank you! Now it works. But I can't see any speedup : for example, if I set vectors x,y,z lenght to 16*10^6 and run 16*10^6 work items, then z = a*x+y takes 0.25 secs (I check only RunKernels(...) function's time), but if I run sequential program for the same task on CPU (core i5 430m), then it takes 0.125 secs, i.e. 2 times faster. For less vector dimensions the difference is even more then 2 times. Is that ok and my GPU radeon 5870 mob. (desktop 5770 with lowered frequencies) is supposed to run this task slower then my CPU?

Thanks.

0 Likes
richeek_arya
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

I am not sure about that but since you are running massive amount of threads not all of them are running concurrently since GPU may not have that many resources. You can try with less number of threads like 1024, 2048 and see if even in that case GPU is slower. You can try Visual profiler too on both GPU and CPU.

All the best!

0 Likes
himanshu_gautam
Grandmaster

maxworkitems for clEnqueueNDRangeKernel(...)

atata,

if it is working with localworksize set to null, geenrally the problem should be that globalworksize is not exactly divisible by local worksize. This is a must condition.

I think richeek's suggestion are worth trying. Also as I remember this sample is memory bound as there are two fetched and 1 write for just 2 practical arithmetic operations, so if it is slower than CPU, i guess it is understandable.

One suggestion is to use the flag "-fno-alias" while buidling the kernel. This should enable the use of caches and you might get some more performance.

 

 

0 Likes
laobrasuca
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

what about GLSL, when you run programs with it you can't choose how work will be dispatched on the shaders. So my question is how works is dispatched. Is it like OpenCL does when you don't specify localworksize? Even if here is not a forum about OpenGL, I'd like to ask if OpenGL 4.2 will allow one to choose such parameter with its new computing pipeline. Will it be able replace OpenCL for CL/GL interoperability?

0 Likes