OpenCL

pontiacgtx · ‎07-16-2020

I have found a problem when executing a kernel the second dimesion of the work units get_global_id(1) get limited to around 120 whille the first dimension can execute every unit until the max set on the first dimension on the host I wonder if there is some kind of limit for each work dimension (some length limit)

for testing this I just printed the second dimension form the kernel

__kernel void Interpolation(__global struct Color* source,__global struct Color* target,uint64 width,uint64 height,uint64 ratio,uint64 limit, uint64 originalHeight)
        {
            __private fp32 wIndex = (int64)get_global_id(0);
            __private fp32 hIndex = (int64)get_global_id(1);

            printf("%d",get_global_id(1));

        
        }‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

the host looks like

void* source = imageObj->originalPixels->data();
        void* target = imageObj->processedPixels->data();



      

        cl_mem originalPixelsBuffer = clCreateBuffer(p1.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(Color) * imageObj->SourceLength(), source, &p1.status);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to Create buffer 0");


        cl_mem targetBuffer = clCreateBuffer(p1.context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(Color) * imageObj->OutputLength(), target, &p1.status);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to Create buffer 1");





        cl_kernel interpolationKernel = NULL;
        p1.CreateKernel(interpolationKernel, "Interpolation");

    



       p1.status = clSetKernelArg(interpolationKernel, 0, sizeof(cl_mem), (void*)&originalPixelsBuffer);
       
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 0 for kernel");
        p1.status = clSetKernelArg(interpolationKernel, 1, sizeof(cl_mem), (void*)&targetBuffer);
       
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 1 for kernel");
        p1.status = clSetKernelArg(interpolationKernel, 2, sizeof(width), (void*)&width);
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 2 for kernel");
        p1.status = clSetKernelArg(interpolationKernel, 3, sizeof(height), (void*)&height);
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 3 for kernel);
        p1.status = clSetKernelArg(interpolationKernel, 4, sizeof(ratio), (void*)&ratio);
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 4 for kernel);
        p1.status = clSetKernelArg(interpolationKernel, 5, sizeof(limit), &limit);
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex], "It Couldn't set Argument 5 for kernel);
        p1.status = clSetKernelArg(interpolationKernel, 6, sizeof(orgHeight), &orgHeight);
        CheckErrorCode(p1.status, p1.program, p1.devices[p1.deviceIndex],"It Couldn't set Argument 6 for kernel);



        StartTimer();
        ////Pass the buffer to the GPU to read


        p1.status = clEnqueueWriteBuffer(p1.commandQueue, originalPixelsBuffer, CL_FALSE, 0, sizeof(Color) * imageObj->SourceLength(), source, 0, NULL, NULL);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to write buffer 0");

        ////Pass the buffer to the GPU to write

        p1.status = clEnqueueWriteBuffer(p1.commandQueue, targetBuffer, CL_TRUE, 0, sizeof(Color) * imageObj->OutputLength(), target, 0, NULL, NULL);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to write buffer 1");

                                        
        size_t  globalWorkSize[2] = { imageObj->originalWidth * 4, imageObj->originalHeight * 4 };
        size_t localWorkSize[2]{ 64,64 };
        SetLocalWorkSize(IsDivisibleBy64(localWorkSize[0]), localWorkSize);
        
        p1.status = clEnqueueNDRangeKernel(p1.commandQueue, interpolationKernel, 2, NULL, globalWorkSize, IsDisibibleByLocalWorkSize(globalWorkSize, localWorkSize) ? localWorkSize : NULL, 0, NULL, NULL);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to clEnqueueDRangeKernel");

        //size_t  globalWorkSize[1] = { imageObj->originalWidth * imageObj->originalHeight };




        p1.status = clEnqueueReadBuffer(p1.commandQueue, targetBuffer, CL_TRUE, 0, sizeof(Color) * imageObj->OutputLength(), target, 0, NULL, NULL);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to write buffer 2");‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

attached the source code from a visual studio project file

now I wonder how woould change the code if I used a SVM instead the cl_mem model from opencl 1.2? would it change how the local work size determines the work group size? if not then what can i do to make it work with this global work size with 2 dimensions?

dipak · ‎07-16-2020

Thank you for reporting it. I've whitelisted you and moved this post to OpenCL forum.

I wonder if there is some kind of limit for each work dimension (some length limit)

There is a limit for work-group. To query the maximum number of work-items that can be specified in each dimension of the work-group, you can call clGetDeviceInfo() with parameter CL_DEVICE_MAX_WORK_ITEM_SIZES. You may also check clinfo output (see "Max work items dimensions" ) to find this limit.

I tried to reproduce the above with a simple test-case, but could not reproduce it. For example, when I ran the below kernel to check get_global_id(1) values, I got expected output.

[note: only few large values are printed to avoid unnecessary print messages]

__kernel void test_kernel()
{
// print the limits to verify (only for first work-item)
if(get_global_id(0) == 0 && get_global_id(1) == 0) {
printf("global size: [%d %d]", get_global_size(0), get_global_size(1));
printf("local size: [%d %d]", get_local_size(0), get_local_size(1));
}

// only for few large values to avoid unnecessary print messages
if(get_global_id(0) > get_global_size(0) - 5 && get_global_id(1) > get_global_size(1) - 5) {
printf("work-item id: [%d %d]", get_global_id(0), get_global_id(1));
}
}
-----------------------------------------------------------------
// HOST-SIDE KERNEL CALL
size_t global_size[] = { 1024, 1024 };
size_t local_size[] = { 8, 8 };
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global_size, local_size, 0, NULL, NULL);

Could you please try the above test-case and share your findings? If you still don't see the expected output, please share your setup details and attach the clinfo output.

P.S. I ran the attached code, but didn't see any print output or error message.

pontiacgtx · ‎07-16-2020

I tried but it doesnt print anything..neither it displays any compilation error or message, now when i print the code I had on the project it print the following (2nd dimension) work item index

if then Global work items are limited to 1024 how it is possible that a 2d image has a limit of 16384x16384, then someone could access an image from a kernel with indeces up to 16384?

dipak · ‎07-16-2020

I tried but it doesnt print anything..neither it displays any compilation error or message

Can you please provide the complete code that you tried?

Also please share your setup details (like OS, driver, graphics card etc. ) and attach the clinfo output.

Thanks.

dipak · ‎07-16-2020

if then Global work items are limited to 1024

The above mentioned limit is for work-group. I've edited my previous statements related to this. Sorry for the misunderstanding.

dipak · ‎07-16-2020

I just tried the above code snippet with large global size and it worked fine on my setup. Please find the kernel output when I set "global_size[] = { 16384, 16384 }":

global size: [16384 16384]
local size: [8 8]
work-item id: [16380 16380]
work-item id: [16381 16380]
work-item id: [16382 16380]
work-item id: [16383 16380]
work-item id: [16380 16382]
work-item id: [16381 16382]
work-item id: [16382 16382]
work-item id: [16383 16382]
work-item id: [16380 16381]
work-item id: [16381 16381]
work-item id: [16382 16381]
work-item id: [16383 16381]
work-item id: [16380 16383]
work-item id: [16381 16383]
work-item id: [16382 16383]
work-item id: [16383 16383]

pontiacgtx · ‎07-16-2020

when running (your suggested kernel) it output the following text in the image

__kernel void test_kernel()
{
     // print the limits to verify (only for first work-item)
     if(get_global_id(0) == 0 && get_global_id(1) == 0)  {
          printf("global size: [%d %d]", get_global_size(0), get_global_size(1));
          printf("local size: [%d %d]", get_local_size(0), get_local_size(1));
     }
 
   // only for few large values to avoid unnecessary print messages
    if(get_global_id(0) > get_global_size(0) - 5 && get_global_id(1) > get_global_size(1) - 5)   {
         printf("work-item id: [%d %d]", get_global_id(0), get_global_id(1));
   }
}‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍


	p1.SetPlatformIndex(0);
	p1.InitializeProgamOCL();
	cl_kernel kernel = NULL;
	p1.CreateKernel(kernel, "test_kernel");
	size_t global_size[]{ 16384,16384 };
	size_t local_size[]{ 8, 8 };

	auto error = clEnqueueNDRangeKernel(p1.commandQueue, kernel, 2, NULL, global_size, local_size, 0, NULL, NULL);
	Sleep(20000);

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

now the problem with my project is that the local work size is not a divider of global work item size, because each size for each dimension on the global work size is width and height, so I read that when you have a case like this you set the local work size as NULL, in theory works but only for a 1 dimension, but for the second it doesnt work and setting a local_size of { NULL , NULL} throws error -54 on runtime

I had read some time ago that using shared virtual memory could allow me to use a number that necesarily doesnt need to be divider of the global work size? would this fix my problem (where my second dimension doesnt get proper index for work work item) ,what could be an example for using SVM ?

System information

Windows 10 1809 64 bits

AMD Driver 20.3.1

Name: gfx900 (Vega 56)

Device OpenCL C version:                       OpenCL C 2.0
Driver version:                                3004.8 (PAL,HSAIL)
Profile:                                       FULL_PROFILE
Version:                                       OpenCL 2.0 AMD-APP (3004.8)

dipak · ‎07-17-2020

System information
Windows 10 1809 64 bits
AMD Driver 20.3.1
Name: gfx900 (Vega 56)

As I know, Adrenalin 20.3.1 had some issue if there are multiple printf statements in the kernel ( Strange printf behaviour on Vega ). That might be reason for the above output. The issue was fixed in Adrenalin 20.5.1. Please try the latest driver (Adrenalin 20.7.2) available here: https://www.amd.com/en/support/graphics/radeon-rx-vega-series/radeon-rx-vega-series/radeon-rx-vega-5...

now the problem with my project is that the local work size is not a divider of global work item size...

As per OpenCL 1.x spec, if local_work_size is explicitly specified, the values specified in global_work_size[0],... global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size[0],... local_work_size[ work_dim - 1]. If local_work_size is NULL, then the OpenCL implementation will determine the appropriate work-group size.

For example, below is the output when I didn't specify any local size (i.e. local_work_size is set as NULL).

// HOST-SIDE KERNEL CALL
size_t global_size[] = { 16384, 16384};
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global_size, NULL, 0, NULL, NULL);
--------------------------------------------
Output:
global size: [16384 16384]
local size: [256 1]
work-item id: [16380 16380]
work-item id: [16381 16380]
work-item id: [16382 16380]
....

I had read some time ago that using shared virtual memory could allow me to use a number that necessarily doesn't need to be divider of the global work size?

Yes, OpenCL 2.x doesn't have the above mentioned restriction. So, in OpenCL 2.x, clEnqueueNDRangeKernel can be called even though the global work size is not evenly divisible by the local work size. However, it is not related to SVM or any specific buffer/memory. For more information, please see: clEnqueueNDRangeKernel

Thanks.

pontiacgtx · ‎07-17-2020

I see so it is OpenCL 2.0 model which allows the kernel to have non divisble work item sizes between global and local size, but I wonder why when callingclEnqueueNDRangeKernel with 20,40 for each dimension where it returns CL_INVALID_WORK_GROUP_SIZE application's kernel says it would compile in OpenCL2.2 do I need to do something to use non divisible local work items (or at least non divisble between the work group size)?

size_t  globalWorkSize[2] = { imageObj->originalWidth * 4, imageObj->originalHeight * 4 };
        size_t localWorkSize[2]{ 20,40 };

        
        p1.status = clEnqueueNDRangeKernel(p1.commandQueue, interpolationKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
        CheckErrorCode(p1.status, p1.program, p1.devices[0], "Failed to clEnqueueDRangeKernel");‍‍‍‍‍‍‍‍‍‍‍‍

dipak · ‎07-17-2020

size_t localWorkSize[2]{ 20,40 };

I believe the above local work group size (20*40 = 800) is greater than the max. work group size supported by the device (which is usually 256 for AMD devices). You can check this value from clinfo output("max. work group size").

Please note, value of "local_work_size[0] * local_work_size[1] * local_work_size[2]" should be less than or equal to "max. work group size".

pontiacgtx · ‎07-17-2020

I assumed the limit was 256 for each dimension, It is good to know that

OpenCL

AMD's OCL global work size with 2d work dimensions limit