16 Replies Latest reply on Jun 3, 2011 12:41 PM by laobrasuca

    maxworkitems for clEnqueueNDRangeKernel(...)

    atata
      error when setting large number

      Hi everyone.

      I recently started learning OpenCL and first of all I tryed to modify an example program 'Template' from OpenCL examples package: that program was mulitplying a vector by a number and I wanted to find a linear combination of 2 vectors: b*x + y, where x and y are (complex) vectors and b is a real number. It doesnt really matter the vectors are complex; I just enter "width" and make calculations with vectors with lenght = 2*width, assuming first width components represent real part and last width components represent imaginary part.

        In 'Template' source file there were many different checks (if memory is allocated correctly etc) including the code attached. As I understand, globalThreads[0] I am passing to the clEnqueueNDRangeKernel(...) is a number of work items (threads) I want to run, but what for is that check followed by  clEnqueueNDRangeKernel(...)? According to that check, if I am trying to run a number of threads greater then maxWorkItemSizes then program terminates, but that makes no sense for me. Moreover, if I check the value of maxWorkItemSizes[0] then its equal to 256 (and maxWorkGroupSize is also equal to 256), so that means I can't run more then 256 threads? If I comment that check and run clEnqueueNDRangeKernel(...) with globalThreads > 256 then I get BSOD or some "videodriver was broken and restored or smth" Windows message and Visual Studio closes. I just want to run my program with some adequate number or threads (work items) but I can't understand what's going wrong here.  The 5-th argument of clEnqueueNDRangeKernel(...) is a number of work items I want to run, right? What's that check followed by it then? I didnt attach all the code, but I can if neccessary (as I said before, most part of the code consists of different checks, I didnt really change much in the algorithm). In 'Template' example there was some number like 64 for GlobalThreads[0] before I started modifying it.

      I am using Win7 x64, MS VS 2010, gpu radeon5870 hd mobility (its the same as desktop 5770 with lowered frequencies). I installed last version of SDK and 11.4 drivers version (I had 11.5 before, but reinstalled 11.4 because there is no info about adequate support of 11.4 for current sdk version).

      Thanks in advance.

       

       

       

       

       

      size_t globalThreads[1]; size_t localThreads[1]; size_t maxWorkGroupSize; size_t maxWorkItemSizes[3]; /** * Query device capabilities. Maximum * work item dimensions and the maximum * work item sizes */ status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), (void*)&maxWorkGroupSize, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(cl_uint), (void*)&maxDims, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(size_t)*maxDims, (void*)maxWorkItemSizes, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } //those 2 numbers are chosen by user globalThreads[0] = 256; LocalThreads[0] = 256; if(globalThreads[0] > maxWorkItemSizes[0] || localThreads[0] > maxWorkGroupSize) { std::cout<<"Unsupported: Device does not support requested number of work items."; return 1; } // some code setting kernel arguments status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); if(status != CL_SUCCESS) { std::cout<< "Error: Enqueueing kernel onto command queue. \ (clEnqueueNDRangeKernel)\n"; }

        • maxworkitems for clEnqueueNDRangeKernel(...)
          mikewolf_gkd

          maxWorkItemSizes should be max workitem in a workgroup.

          but you can define many workgroup.

          for example:

          globalThreads[0] = 1024; 
          LocalThreads[0]  = 256;

          thus , you have 4 workgroup

            • maxworkitems for clEnqueueNDRangeKernel(...)
              himanshu.gautam

              Hi atata,

              The function clEnqueueNDRangeKernel takes importantly two parameters which are confusing you( see the spec for details about other params),

              globalWorkItemSize: You can specify a 1D,2D or 3D vector size for which you want to run your kernel. This can be virtually any number however high.

              localWorkItemSize(workGroup Size): This is the size in which GPU divides your problem. AMD GPUs can divide the problem into anything less than equal to 256. The idea is that all workItems in one workgroup executes together and you can check for sync between them.

              I recommend to read the OpenCL Programming Guide(Chapter 1) to better understand the concept of Workgroups.

              Thanks

                • maxworkitems for clEnqueueNDRangeKernel(...)
                  atata

                  mikewolf_gkd, himanshu.gautam, thanks for your answers.

                  As I understood after reading documentation and your answers, in the function clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0])

                  localThreads is a number which defines maximum threads in one workgroup, localThreads is less or equal to 256 for my GPU and I have no problem with it. If i am running 1024 total threads (work items) and 256 threads in one group then I have 4 total groups and I can sync work items inside each of those groups.

                   globalThreads is a total number of work items (threads); ok, we can use  an array with 1,2 or 3 elements representing the number of workitems for each dimension - for example, if globalThreads[3] = {5, 10, 2}, then total number of threads we are running is 5*10*2 = 100, and if globalThreads[1] = 1000, then we have 1000 threads (work items) running total, right?

                  So if I want to run my program with, for example, 10 000 threads divided into 5000 groups, then I call clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]) with globalThreads[0] = 10 000, localThreads[0] = 2, where globalThreads and localThreads are both arrays containing 1 element each?

                  The problem is: when I set globalThreads[0] (its an array with 1 element here, size_t globalThreads[1]) to some number more then 256, I get BSOD or some videodriver error and program terminates). If things I wrote above are correct, then it means I can't run more then 256 threads with my GPU!

                   Or if I want to have 10 000 threads then I should use 3d vector for globalThreads, for example, if I set size_t globalThreads[3] =  { 100, 50, 2}, then I have 100*50*2 =10 000 total work items (threads) ? So that means if CL_DEVICE_MAX_WORK_ITEM_SIZES for my GPU is equal to 256 (the result I get by running a check from the 1st post) then I can't have more then 256 elements in any of 3 dimensions for globalThreads, and maximum number of work sizes (threads) I can run is 256*256*256 ? If not - what arguments should I pass to clEnqueueNDRangeKernel if I want, for example, 10 000 work items (threads) ? I am sorry for those questions may look silly, but that confuses me.

                  Thanks.

                   

                   

                   

                    • maxworkitems for clEnqueueNDRangeKernel(...)
                      himanshu.gautam

                      The facts you wrote about localthreads and globalThreads are correct.

                      But you are allowed to set any number in any dimension of globalThreads theoritically. So your example of globalThreads[0] = 10000 should work.

                       

                      Large global sizes are being used in many SDK samples. Try to compare your code from samples.

                        • maxworkitems for clEnqueueNDRangeKernel(...)
                          atata

                          himanshu.gautam, thanks again. I checked some SDK samples and all of them work fine with globalThreads > 256. I am still confused with this, I will try to find how to fix that, but, to be honest, I have no idea what's the reason.

                          • maxworkitems for clEnqueueNDRangeKernel(...)
                            richeek.arya

                             

                            Originally posted by: himanshu.gautam

                             

                            But you are allowed to set any number in any dimension of globalThreads theoritically. So your example of globalThreads[0] = 10000 should work.

                             

                              Large global sizes are being used in many SDK samples. Try to compare your code from samples.

                             

                            Hi Himanshu,

                            I have a small confusion. Since we can specify any number of global threads in any dimension then:

                            1. Is there any advantage of having three dimensions of global threads since any dimension can have any number of threads?

                            2. What is the significance of CL_DEVICE_MAX_WORK_ITEM_SIZES flag in the OpenCL since we can specify any number of work items?

                            Thanks,

                            Richeek

                             

                              • maxworkitems for clEnqueueNDRangeKernel(...)
                                himanshu.gautam

                                1. Three dimensions are provided for logical clearity. There are many cases when we deal with 3D arrays in C  1D array can always do the same work.

                                 

                                2. CL_DEVICE_MAX_WORK_ITEM_SIZES tells the number of workitems in each dimension that you can have inside a WORKGROUP and not globally.

                                 

                                Thanks

                                • maxworkitems for clEnqueueNDRangeKernel(...)
                                  nou

                                  if you process some volume 3D data i think it is convient use 3D NDRange.

                                  CL_DEVICE_MAX_WORK_ITEM_SIZES return maximum sizes of local work group which is indeed limited. on AMD GPU it is 256x256x256 and CPU it is 1024x1024x1024.

                                    • maxworkitems for clEnqueueNDRangeKernel(...)
                                      richeek.arya

                                      Himanshu and Nou, thanks for your replies. I understand what you are saying. I just want one more clarification:

                                      In the AMD SDK open CL example "template"  there is a check performed:

                                      if(globalThreads[0] > maxWorkItemSizes[0] ||
                                              localThreads[0] > maxWorkGroupSize)

                                      {
                                              std::cout<<"Unsupported: Device does not support requested number of work items.";
                                              return 1;
                                          }

                                      Is there any reason for checking (globalThreads[0] > maxWorkItemSizes[0])?

                                      I ran this example with globalThreads[0] = 1000 and it ran just fine as expected (with the if clause commented out ofcourse)

                                      Thanks,

                                      Richeek

                                        • maxworkitems for clEnqueueNDRangeKernel(...)
                                          himanshu.gautam

                                          oh.

                                          That is a bug.

                                          Thanks for pointing it out. This will be fixed in next SDK release.

                                            • maxworkitems for clEnqueueNDRangeKernel(...)
                                              atata

                                              Well, I tryed to run my simple program from different computers and everywhere it crashes if I set globalThreads[0] > 256. I removed success checks some other stuff (printing results, time checking etc) from the code - could someone please take a look at this and say what can be the problem? This is just a bit modified 'Template' example from amd app sdk; there are no functions except starting and releasing kernel ones (well, only one function verifying result), and the kernel code is pretty simple. This program takes two vectors input1, input2 (both of lenght = 2*width, width is given) and some integer multiplier and returns resulting vector output = a*input1+input2. In this attached code first goes .cpp source code, then .cl kernel code, then .hpp variables definition. I am really confused with this because I didnt change much from 'Template' example - maybe, I did something wrong setting kernel arguments or something. This code works fine with GlobalThreads[0] <= 256 (I always set localThreads to 256 also).

                                              Thanks much.

                                              //Template.cpp include "Template.hpp" int initializeHost(void) { width = 2000000; // vectors dimensions = 2 * width input = NULL; input2 = NULL; output = NULL; multiplier = 2; cl_uint sizeInBytes = 2 * width * sizeof(cl_uint); input = (cl_uint *) malloc(sizeInBytes); input2 = (cl_uint *) malloc(sizeInBytes); output = (cl_uint *) malloc(sizeInBytes); // input values for(cl_uint i = 0; i < 2 * width; i++) { input[i] = i; input2[i] = 3*i; } return 0; } std::string convertToString(const char *filename) { //some code converting file text to string } int initializeCL(void) { cl_int status = 0; size_t deviceListSize; cl_uint numPlatforms; cl_platform_id platform = NULL; status = clGetPlatformIDs(0, NULL, &numPlatforms); if(numPlatforms > 0) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); for(unsigned int i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL); platform = platforms[i]; } delete platforms; } cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status); status = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); devices = (cl_device_id *)malloc(deviceListSize); status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); commandQueue = clCreateCommandQueue( context, devices[0], 0, &status); inputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, input, &status); input2Buffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, input2, &status); outputBuffer = clCreateBuffer( context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * 2 * width, output, &status); const char * filename = "Template_Kernels.cl"; std::string sourceStr = convertToString(filename); const char * source = sourceStr.c_str(); size_t sourceSize[] = { strlen(source) }; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status); status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); kernel = clCreateKernel(program, "templateKernel", &status); return 0; } int runCLKernels(void) { cl_int status; cl_uint maxDims; cl_event events[2]; size_t globalThreads[1] = {256}; size_t localThreads[1] = {256}; status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *)&outputBuffer); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *)&inputBuffer); status = clSetKernelArg( kernel, 2, sizeof(cl_mem), (void *)&input2Buffer); status = clSetKernelArg( kernel, 3, sizeof(cl_uint), (void *)&multiplier); status = clSetKernelArg( kernel, 4, sizeof(cl_uint), (void *)&width); status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); status = clWaitForEvents(1, &events[0]); status = clReleaseEvent(events[0]); status = clEnqueueReadBuffer( commandQueue, outputBuffer, CL_TRUE, 0, 2 * width * sizeof(cl_uint), output, 0, NULL, &events[1]); status = clWaitForEvents(1, &events[1]); status = clReleaseEvent(events[1]); return 0; } int cleanupCL(void) { cl_int status; status = clReleaseKernel(kernel); status = clReleaseProgram(program); status = clReleaseMemObject(inputBuffer); status = clReleaseMemObject(outputBuffer); status = clReleaseCommandQueue(commandQueue); status = clReleaseContext(context); return 0; } void cleanupHost(void) { if(input != NULL) { free(input); input = NULL; } if(output != NULL) { free(output); output = NULL; } if(devices != NULL) { free(devices); devices = NULL; } } void verify() { bool passed = true; for(unsigned long i = 0; i < 2 * width; ++i) if(input[i] * multiplier + input2[i] != output[i]) passed = false; if(passed == true) std::cout << "Passed!\n"; else std::cout << "Failed!\n"; } int main(int argc, char * argv[]) { if(initializeHost()==1) return 1; if(initializeCL()==1) return 1; if(runCLKernels()==1) return 1; verify(); if(cleanupCL()==1) return 1; cleanupHost(); getchar(); return 0; } //Template_Kernels.cl __kernel void templateKernel(__global unsigned int * output, __global unsigned int * input, __global unsigned int * input2, const unsigned int multiplier, const unsigned int width) { uint tid = get_global_id(0); uint i; //this is code for 256 threads; for correct results for more threads 256 must be changed to number of threads for (i = tid * (width * 2 ) / 256 ; i < ( tid + 1 ) * (width * 2 ) / 256; i++) output[i] = multiplier * input[i] + input2[i]; } //Template.hpp #ifndef TEMPLATE_H_ #define TEMPLATE_H_ #include <CL/cl.h> #include <string.h> #include <cstdlib> #include <iostream> #include <string> #include <fstream> #include "windows.h" #include<stdio.h> cl_uint *input; cl_uint *input2; cl_uint *output; cl_uint multiplier; cl_ulong width; cl_mem inputBuffer; cl_mem input2Buffer; cl_mem outputBuffer; cl_context context; cl_device_id *devices; cl_command_queue commandQueue; cl_program program; cl_kernel kernel; int initializeCL(void); std::string convertToString(const char * filename); int runCLKernels(void); int cleanupCL(void); void cleanupHost(void); #endif /* #ifndef TEMPLATE_H_ */

                                                • maxworkitems for clEnqueueNDRangeKernel(...)
                                                  richeek.arya

                                                  Try running it with localThreads = NULL in clENqueueNDRangeKernel function...like this:

                                                  status = clEnqueueNDRangeKernel(
                                                                commandQueue,
                                                                   kernel,
                                                                   1,
                                                                   NULL,
                                                                   globalThreads,
                                                                   NULL,  
                                                               0,
                                                                   NULL,
                                                                   &events[0]);

                                                   

                                                    • maxworkitems for clEnqueueNDRangeKernel(...)
                                                      atata

                                                      richeek.arya, thank you! Now it works. But I can't see any speedup : for example, if I set vectors x,y,z lenght to 16*10^6 and run 16*10^6 work items, then z = a*x+y takes 0.25 secs (I check only RunKernels(...) function's time), but if I run sequential program for the same task on CPU (core i5 430m), then it takes 0.125 secs, i.e. 2 times faster. For less vector dimensions the difference is even more then 2 times. Is that ok and my GPU radeon 5870 mob. (desktop 5770 with lowered frequencies) is supposed to run this task slower then my CPU?

                                                      Thanks.

                                                        • maxworkitems for clEnqueueNDRangeKernel(...)
                                                          richeek.arya

                                                          I am not sure about that but since you are running massive amount of threads not all of them are running concurrently since GPU may not have that many resources. You can try with less number of threads like 1024, 2048 and see if even in that case GPU is slower. You can try Visual profiler too on both GPU and CPU.

                                                          All the best!

                                                            • maxworkitems for clEnqueueNDRangeKernel(...)
                                                              himanshu.gautam

                                                              atata,

                                                              if it is working with localworksize set to null, geenrally the problem should be that globalworksize is not exactly divisible by local worksize. This is a must condition.

                                                              I think richeek's suggestion are worth trying. Also as I remember this sample is memory bound as there are two fetched and 1 write for just 2 practical arithmetic operations, so if it is slower than CPU, i guess it is understandable.

                                                              One suggestion is to use the flag "-fno-alias" while buidling the kernel. This should enable the use of caches and you might get some more performance.

                                                               

                                                               

                                                                • maxworkitems for clEnqueueNDRangeKernel(...)
                                                                  laobrasuca

                                                                  what about GLSL, when you run programs with it you can't choose how work will be dispatched on the shaders. So my question is how works is dispatched. Is it like OpenCL does when you don't specify localworksize? Even if here is not a forum about OpenGL, I'd like to ask if OpenGL 4.2 will allow one to choose such parameter with its new computing pipeline. Will it be able replace OpenCL for CL/GL interoperability?