cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

atata
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

error when setting large number

Hi everyone.

I recently started learning OpenCL and first of all I tryed to modify an example program 'Template' from OpenCL examples package: that program was mulitplying a vector by a number and I wanted to find a linear combination of 2 vectors: b*x + y, where x and y are (complex) vectors and b is a real number. It doesnt really matter the vectors are complex; I just enter "width" and make calculations with vectors with lenght = 2*width, assuming first width components represent real part and last width components represent imaginary part.

  In 'Template' source file there were many different checks (if memory is allocated correctly etc) including the code attached. As I understand, globalThreads[0] I am passing to the clEnqueueNDRangeKernel(...) is a number of work items (threads) I want to run, but what for is that check followed by  clEnqueueNDRangeKernel(...)? According to that check, if I am trying to run a number of threads greater then maxWorkItemSizes then program terminates, but that makes no sense for me. Moreover, if I check the value of maxWorkItemSizes[0] then its equal to 256 (and maxWorkGroupSize is also equal to 256), so that means I can't run more then 256 threads? If I comment that check and run clEnqueueNDRangeKernel(...) with globalThreads > 256 then I get BSOD or some "videodriver was broken and restored or smth" Windows message and Visual Studio closes. I just want to run my program with some adequate number or threads (work items) but I can't understand what's going wrong here.  The 5-th argument of clEnqueueNDRangeKernel(...) is a number of work items I want to run, right? What's that check followed by it then? I didnt attach all the code, but I can if neccessary (as I said before, most part of the code consists of different checks, I didnt really change much in the algorithm). In 'Template' example there was some number like 64 for GlobalThreads[0] before I started modifying it.

I am using Win7 x64, MS VS 2010, gpu radeon5870 hd mobility (its the same as desktop 5770 with lowered frequencies). I installed last version of SDK and 11.4 drivers version (I had 11.5 before, but reinstalled 11.4 because there is no info about adequate support of 11.4 for current sdk version).

Thanks in advance.

 

 

 

 

 

size_t globalThreads[1]; size_t localThreads[1]; size_t maxWorkGroupSize; size_t maxWorkItemSizes[3]; /** * Query device capabilities. Maximum * work item dimensions and the maximum * work item sizes */ status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), (void*)&maxWorkGroupSize, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(cl_uint), (void*)&maxDims, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } status = clGetDeviceInfo( devices[0], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(size_t)*maxDims, (void*)maxWorkItemSizes, NULL); if(status != CL_SUCCESS) { std::cout<<"Error: Getting Device Info. (clGetDeviceInfo)\n"; getchar(); return 1; } //those 2 numbers are chosen by user globalThreads[0] = 256; LocalThreads[0] = 256; if(globalThreads[0] > maxWorkItemSizes[0] || localThreads[0] > maxWorkGroupSize) { std::cout<<"Unsupported: Device does not support requested number of work items."; return 1; } // some code setting kernel arguments status = clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]); if(status != CL_SUCCESS) { std::cout<< "Error: Enqueueing kernel onto command queue. \ (clEnqueueNDRangeKernel)\n"; }

0 Likes
16 Replies
mikewolf_gkd
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

maxWorkItemSizes should be max workitem in a workgroup.

but you can define many workgroup.

for example:

globalThreads[0] = 1024; 
LocalThreads[0]  = 256;

thus , you have 4 workgroup

0 Likes
himanshu_gautam
Grandmaster

maxworkitems for clEnqueueNDRangeKernel(...)

Hi atata,

The function clEnqueueNDRangeKernel takes importantly two parameters which are confusing you( see the spec for details about other params),

globalWorkItemSize: You can specify a 1D,2D or 3D vector size for which you want to run your kernel. This can be virtually any number however high.

localWorkItemSize(workGroup Size): This is the size in which GPU divides your problem. AMD GPUs can divide the problem into anything less than equal to 256. The idea is that all workItems in one workgroup executes together and you can check for sync between them.

I recommend to read the OpenCL Programming Guide(Chapter 1) to better understand the concept of Workgroups.

Thanks

0 Likes
atata
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

mikewolf_gkd, himanshu.gautam, thanks for your answers.

As I understood after reading documentation and your answers, in the function clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0])

localThreads is a number which defines maximum threads in one workgroup, localThreads is less or equal to 256 for my GPU and I have no problem with it. If i am running 1024 total threads (work items) and 256 threads in one group then I have 4 total groups and I can sync work items inside each of those groups.

 globalThreads is a total number of work items (threads); ok, we can use  an array with 1,2 or 3 elements representing the number of workitems for each dimension - for example, if globalThreads[3] = {5, 10, 2}, then total number of threads we are running is 5*10*2 = 100, and if globalThreads[1] = 1000, then we have 1000 threads (work items) running total, right?

So if I want to run my program with, for example, 10 000 threads divided into 5000 groups, then I call clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]) with globalThreads[0] = 10 000, localThreads[0] = 2, where globalThreads and localThreads are both arrays containing 1 element each?

The problem is: when I set globalThreads[0] (its an array with 1 element here, size_t globalThreads[1]) to some number more then 256, I get BSOD or some videodriver error and program terminates). If things I wrote above are correct, then it means I can't run more then 256 threads with my GPU!

 Or if I want to have 10 000 threads then I should use 3d vector for globalThreads, for example, if I set size_t globalThreads[3] =  { 100, 50, 2}, then I have 100*50*2 =10 000 total work items (threads) ? So that means if CL_DEVICE_MAX_WORK_ITEM_SIZES for my GPU is equal to 256 (the result I get by running a check from the 1st post) then I can't have more then 256 elements in any of 3 dimensions for globalThreads, and maximum number of work sizes (threads) I can run is 256*256*256 ? If not - what arguments should I pass to clEnqueueNDRangeKernel if I want, for example, 10 000 work items (threads) ? I am sorry for those questions may look silly, but that confuses me.

Thanks.

 

 

 

0 Likes
himanshu_gautam
Grandmaster

maxworkitems for clEnqueueNDRangeKernel(...)

The facts you wrote about localthreads and globalThreads are correct.

But you are allowed to set any number in any dimension of globalThreads theoritically. So your example of globalThreads[0] = 10000 should work.

 

Large global sizes are being used in many SDK samples. Try to compare your code from samples.

0 Likes
atata
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

himanshu.gautam, thanks again. I checked some SDK samples and all of them work fine with globalThreads > 256. I am still confused with this, I will try to find how to fix that, but, to be honest, I have no idea what's the reason.

0 Likes
richeek_arya
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

Originally posted by: himanshu.gautam

 

But you are allowed to set any number in any dimension of globalThreads theoritically. So your example of globalThreads[0] = 10000 should work.

 

  Large global sizes are being used in many SDK samples. Try to compare your code from samples.

 

Hi Himanshu,

I have a small confusion. Since we can specify any number of global threads in any dimension then:

1. Is there any advantage of having three dimensions of global threads since any dimension can have any number of threads?

2. What is the significance of CL_DEVICE_MAX_WORK_ITEM_SIZES flag in the OpenCL since we can specify any number of work items?

Thanks,

Richeek

 

0 Likes
himanshu_gautam
Grandmaster

maxworkitems for clEnqueueNDRangeKernel(...)

1. Three dimensions are provided for logical clearity. There are many cases when we deal with 3D arrays in C  1D array can always do the same work.

 

2. CL_DEVICE_MAX_WORK_ITEM_SIZES tells the number of workitems in each dimension that you can have inside a WORKGROUP and not globally.

 

Thanks

0 Likes
nou
Exemplar

maxworkitems for clEnqueueNDRangeKernel(...)

if you process some volume 3D data i think it is convient use 3D NDRange.

CL_DEVICE_MAX_WORK_ITEM_SIZES return maximum sizes of local work group which is indeed limited. on AMD GPU it is 256x256x256 and CPU it is 1024x1024x1024.

0 Likes
richeek_arya
Journeyman III

maxworkitems for clEnqueueNDRangeKernel(...)

Himanshu and Nou, thanks for your replies. I understand what you are saying. I just want one more clarification:

In the AMD SDK open CL example "template"  there is a check performed:

if(globalThreads[0] > maxWorkItemSizes[0] ||
        localThreads[0] > maxWorkGroupSize)

{
        std::cout<<"Unsupported: Device does not support requested number of work items.";
        return 1;
    }

Is there any reason for checking (globalThreads[0] > maxWorkItemSizes[0])?

I ran this example with globalThreads[0] = 1000 and it ran just fine as expected (with the if clause commented out ofcourse)

Thanks,

Richeek

0 Likes