cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dwootton
Journeyman III

Missing loop condition test in OpenCL for loop

Generated code in for loop for CPU device missing loop conditon test

I am trying to learn about OpenCL and have a kernel that isn't working correctly. I'm running a program compiled with Visual Studio 2008 on Windows 7 using the 12/21/09 OpenCL version I downloaded.

When I invoke my program in debug mode I get an exception in the dll generated for my code. When I look at the disassembly view in Visual Studio, I see code for the for loop which seems to be updating the arrays correctly. I can also see where I is incremented, where it's been optimized into register eax and eax is incremented by 4 each loop iteration. However, I don't see any loop test in the generated code, which I think explains the exception since eax no longer points to valid memory.

The code is attached


I'm invoking this via clEnqueueTask.

The odd thing is I thought I had code where the loop test was present before I rewrote this code to get rid of some pointers, but now I can't get the compiler to generate any loop test.

 

Am I doing something wrong?

Thanks

 

__kernel void computeHistogram(__global unsigned int *imageData, global int *redCounts, global int *greenCounts, global int *blueCounts, int globalCount, int threads, int index) { int count; int threads; int index; int histogramBase; int imageOffset; count = globalCount / threads; histogramBase = index * 256; imageOffset = count * index; for (int i = 0; i < count; i++) { unsigned char r; unsigned char g; unsigned char b; unsigned int pixel; pixel = imageData[imageOffset + i++]; r = pixel >> 16; g = pixel >> 8; b = pixel; redCounts[histogramBase + r]++; greenCounts[histogramBase + g]++; blueCounts[histogramBase + b]++; } }

0 Likes
10 Replies
dwootton
Journeyman III

I spotted a typo in my code just after posting. The line

                pixel = imageData[imageOffset + i++];

should be

                pixel = imageData[imageOffset + i];

The i++ was an attempt to rewrite the loop as a while loop to try and get it to work. That didn't work either.

0 Likes

Don't you need to mask and cast your pixel to extract the channel values?

(I am also an OpenCL newbie, so I can't yet answer authoritatively on the real meat of your question 🙂 )

 

r = (unsigned char )((pixel >> 16) & 0xFF); g = (unsigned char )((pixel >> 😎 & 0xFF); b = (unsigned char )(pixel & 0xFF);

0 Likes

I don't think I need the cast or the mask. I'm pretty sure the compiler is properly implicitly casting unsigned int to unsigned char. Since unsigned char is 8 bits, the 0xff mask is not needed since the store into the unsigned char variable truncates the higher order bits. I have another (data parallel) kernel where I do the same thing and that works fine.

0 Likes

dwootton,
Your code won't work because you need to use atomics on global memory to correctly do the increment in this fashion.
Also, index is the same for all threads in a kernel, please use get_global_id(0) for a 1D array to get a unique index per thread.
Please see the work-item functions in section 6.11.1 for other helpful functions.
0 Likes

Micah, thanks for the reply. I should clarify what my C program that does this is doing since I don't follow the reasoning behind your answer.

At initialization time, I call clGetDeviceInfo with the 2nd parameter CL_DEVICE_MAX_COMPUTE_UNITS, which since I'm running a dual core CPU tells me 2. This value gets passed in to the kernal as the 'threads' parameter.

At the point where I want to invoke the kernel, I set up all the kernel parameters except index with the same value. Then I invoke a loop which sets the index parameter to a unique index (0 and 1 in the case of two compute units) and invokes clEnqueueTask, with the result that index is 0 for the first invocation of clEnqueueTask and 1 for the second invocation of clEnqueueTask.

My understanding is that I then have two threads running.

Each of the parameters redCounts, greenCounts and blueCounts is a single dimension array with 256*CL_DEVICE_MAX_COMPUTE_UNITS integers. My undestanding is that histogramBase*256 gives me a base offset into each array of 0 for thread 0 and 256 for thread 1, so I shouldn't have two threads updating the same array element and thus no need for atomic operations.

If I call get_global_id() in each kernel instance, am I going to get unique values (0 and 1) for each kernel instance?If I was using a data parallel model, I would expect to get unique ids per thread.

Does this have anything to do with why the generated code for the for loop has no test for the loop upper limit?

 

This is the code Visual Studio disassembly view shows me at the point of the exception

005D104E  nop              
005D104F  nop              
005D1050  push        ebx  
005D1051  push        edi  
005D1052  push        esi  
005D1053  mov         eax,dword ptr [esp+10h]
005D1057  mov         ecx,dword ptr [eax+0Ch]
005D105A  mov         edx,dword ptr [eax+8]
005D105D  mov         esi,dword ptr [eax+4]
005D1060  mov         eax,dword ptr [eax]
005D1062  jmp          005D1070
005D1064  lea           esi,[esi]
005D106A  lea           edi,[edi]
005D1070  mov         edi,dword ptr [eax]
005D1072  mov         ebx,edi
005D1074  shr          edi,0Eh
005D1077  and         edi,3FCh
005D107D  inc           dword ptr [esi+edi]
005D1080  movzx      edi,bh
005D1083  inc           dword ptr [edx+edi*4]
005D1086  movzx     edi,bl
005D1089  inc          dword ptr [ecx+edi*4]
005D108C  add         eax,4
005D108F  jmp         005D1070
005D1091  nop              
005D1092  nop

0 Likes

Well, part of my problem is that I had the variables threads and index declared both as parameters in the kernel invocation and as variables in the kernel body. Once I removed the definitions in the kernel body, I no longer get the exception.

Shouldn't the compiler flag this as an error?

0 Likes

Originally posted by: dwootton Well, part of my problem is that I had the variables threads and index declared both as parameters in the kernel invocation and as variables in the kernel body. Once I removed the definitions in the kernel body, I no longer get the exception.

 

Shouldn't the compiler flag this as an error?

 

dwootton,

            Could you please give us complete test case(kernel code and runtime code) which shows your problem?

0 Likes

genaganna

Here's a test case that causes the access viloation. I realize this is ultimately a user error, but it took me a long while to figure out that the duplicate variables were causing the problem. I didn't associate duplicate variables with the resulting missing loop conditon test.

Also, I'm compiling the kernel source using OpenCL functions clCreateProgramFromSource and clBuildProgram which both return CL_SUCCESS because the diagnostic flagging the duplicate variables is a warning, and so I therefore have no notification that there is a warning diagnostic to look at. Should this diagnostic be an error or are there C/C++ name scoping rules where shadowing a parameter is valid usage?

 

/* /* TaskParallel.cpp /* #include <assert.h> #include <stdio.h> #include "TaskParallel.h" #ifndef NULL #define NULL 0 #endif const char *TaskParallel::kernelSource = " __kernel void" " computeHistogram(__global unsigned int *imageData, global int *redCounts, global int *greenCounts, global int *blueCounts, int globalCount, int threads, int index)" " {" " int histogramBase;" " int imageOffset;" " int count;" " int threads;" // Remove these two lines redeclaring threads and index " int index;" // and the problem goes away. " " " count = globalCount / threads;" " histogramBase = index * 256;" " imageOffset = count * index;" " for (int i = 0; i < count; i++) {" " unsigned char r;" " unsigned char g;" " unsigned char b;" " unsigned int pixel;" " " " pixel = imageData[imageOffset + i];" " r = pixel >> 16;" " g = pixel >> 8;" " b = pixel;" " redCounts[histogramBase + r]++;" " greenCounts[histogramBase + g]++;" " blueCounts[histogramBase + b]++;" " }" " }"; TaskParallel::TaskParallel(void) { } TaskParallel::~TaskParallel(void) { } void TaskParallel::runTest() { cl_int status; cl_platform_id platform; cl_device_id device; cl_context_properties properties[4]; cl_command_queue queue; cl_context context; cl_program program; cl_kernel kernel; cl_mem image; cl_mem histogram[3]; cl_event event[2]; int elementCount; int numThreads; int threadIndex; status = clGetPlatformIDs(1, &platform, NULL); assert(status == CL_SUCCESS); status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL); assert(status == CL_SUCCESS); properties[0] = CL_CONTEXT_PLATFORM; properties[1] = (cl_context_properties) platform; properties[2] = 0; properties[3] = 0; context = clCreateContext(properties, 1, &device, NULL, NULL, &status); assert (status == CL_SUCCESS); queue = clCreateCommandQueue(context, device, 0, &status); assert (status == CL_SUCCESS); program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &status); assert (status == CL_SUCCESS); status = clBuildProgram(program, 1, &device, NULL, NULL, NULL); assert (status == CL_SUCCESS); kernel = clCreateKernel(program, "computeHistogram", &status); assert (status == CL_SUCCESS); image = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, 100000 * sizeof(int), NULL, &status); // Assume two cores in CPU device assert (status == CL_SUCCESS); for (int i = 0; i < 3; i++) { histogram = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, 256 * 2 * sizeof(int), NULL, &status); assert (status == CL_SUCCESS); } elementCount = 100000; numThreads = 2; status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &image); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &histogram[0]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &histogram[1]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &histogram[2]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 4, sizeof(int), &elementCount); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 5, sizeof(int), &numThreads); assert (status == CL_SUCCESS); // Don't initialize data since data is irreleveant // Run two threads, each process 50000 elements threadIndex = 0; status = clSetKernelArg(kernel, 6, sizeof(int), &threadIndex); assert (status == CL_SUCCESS); status = clEnqueueTask(queue, kernel, 0, NULL, &event[0]); threadIndex = 1; status = clSetKernelArg(kernel, 6, sizeof(int), &threadIndex); assert (status == CL_SUCCESS); status = clEnqueueTask(queue, kernel, 0, NULL, &event[1]); assert (status == CL_SUCCESS); status = clWaitForEvents(2, event); assert(status == CL_SUCCESS); } int main(int argc, char *argv[]) { TaskParallel testCase; testCase.runTest(); printf("Done\n"); } /* * TaskParallel.h */ #pragma once #include <CL/cl.h> class TaskParallel { public: static const char *kernelSource; TaskParallel(void); ~TaskParallel(void); void runTest(); };

0 Likes

Originally posted by: dwootton genaganna

 

Here's a test case that causes the access viloation. I realize this is ultimately a user error, but it took me a long while to figure out that the duplicate variables were causing the problem. I didn't associate duplicate variables with the resulting missing loop conditon test.

 

Also, I'm compiling the kernel source using OpenCL functions clCreateProgramFromSource and clBuildProgram which both return CL_SUCCESS because the diagnostic flagging the duplicate variables is a warning, and so I therefore have no notification that there is a warning diagnostic to look at. Should this diagnostic be an error or are there C/C++ name scoping rules where shadowing a parameter is valid usage?

 

 



Dwootton,

              Thanks for reporting this issue. I feel compiler should give error in this case.  I reported this issue to developers.

 

0 Likes

dwooton,
There are the warnings I get from the compiler when I run your kernel.
wacky.cl(5): warning: declaration of "threads" hides function parameter
int threads;
^

wacky.cl(6): warning: declaration of "index" hides function parameter
int index;
^

wacky.cl(10): warning: variable "threads" is used before its value is set
count = globalCount / threads;
^

wacky.cl(11): warning: variable "index" is used before its value is set
histogramBase = index * 256;
^

Make sure you check the compile log via clGetProgramBuildInfo.
0 Likes