I am trying to learn about OpenCL and have a kernel that isn't working correctly. I'm running a program compiled with Visual Studio 2008 on Windows 7 using the 12/21/09 OpenCL version I downloaded.
When I invoke my program in debug mode I get an exception in the dll generated for my code. When I look at the disassembly view in Visual Studio, I see code for the for loop which seems to be updating the arrays correctly. I can also see where I is incremented, where it's been optimized into register eax and eax is incremented by 4 each loop iteration. However, I don't see any loop test in the generated code, which I think explains the exception since eax no longer points to valid memory.
The code is attached
I'm invoking this via clEnqueueTask.
The odd thing is I thought I had code where the loop test was present before I rewrote this code to get rid of some pointers, but now I can't get the compiler to generate any loop test.
Am I doing something wrong?
Thanks
__kernel void computeHistogram(__global unsigned int *imageData, global int *redCounts, global int *greenCounts, global int *blueCounts, int globalCount, int threads, int index) { int count; int threads; int index; int histogramBase; int imageOffset; count = globalCount / threads; histogramBase = index * 256; imageOffset = count * index; for (int i = 0; i < count; i++) { unsigned char r; unsigned char g; unsigned char b; unsigned int pixel; pixel = imageData[imageOffset + i++]; r = pixel >> 16; g = pixel >> 8; b = pixel; redCounts[histogramBase + r]++; greenCounts[histogramBase + g]++; blueCounts[histogramBase + b]++; } }
I spotted a typo in my code just after posting. The line
pixel = imageData[imageOffset + i++];
should be
pixel = imageData[imageOffset + i];
The i++ was an attempt to rewrite the loop as a while loop to try and get it to work. That didn't work either.
Don't you need to mask and cast your pixel to extract the channel values?
(I am also an OpenCL newbie, so I can't yet answer authoritatively on the real meat of your question 🙂 )
r = (unsigned char )((pixel >> 16) & 0xFF); g = (unsigned char )((pixel >> 😎 & 0xFF); b = (unsigned char )(pixel & 0xFF);
I don't think I need the cast or the mask. I'm pretty sure the compiler is properly implicitly casting unsigned int to unsigned char. Since unsigned char is 8 bits, the 0xff mask is not needed since the store into the unsigned char variable truncates the higher order bits. I have another (data parallel) kernel where I do the same thing and that works fine.
Micah, thanks for the reply. I should clarify what my C program that does this is doing since I don't follow the reasoning behind your answer.
At initialization time, I call clGetDeviceInfo with the 2nd parameter CL_DEVICE_MAX_COMPUTE_UNITS, which since I'm running a dual core CPU tells me 2. This value gets passed in to the kernal as the 'threads' parameter.
At the point where I want to invoke the kernel, I set up all the kernel parameters except index with the same value. Then I invoke a loop which sets the index parameter to a unique index (0 and 1 in the case of two compute units) and invokes clEnqueueTask, with the result that index is 0 for the first invocation of clEnqueueTask and 1 for the second invocation of clEnqueueTask.
My understanding is that I then have two threads running.
Each of the parameters redCounts, greenCounts and blueCounts is a single dimension array with 256*CL_DEVICE_MAX_COMPUTE_UNITS integers. My undestanding is that histogramBase*256 gives me a base offset into each array of 0 for thread 0 and 256 for thread 1, so I shouldn't have two threads updating the same array element and thus no need for atomic operations.
If I call get_global_id() in each kernel instance, am I going to get unique values (0 and 1) for each kernel instance?If I was using a data parallel model, I would expect to get unique ids per thread.
Does this have anything to do with why the generated code for the for loop has no test for the loop upper limit?
This is the code Visual Studio disassembly view shows me at the point of the exception
005D104E nop
005D104F nop
005D1050 push ebx
005D1051 push edi
005D1052 push esi
005D1053 mov eax,dword ptr [esp+10h]
005D1057 mov ecx,dword ptr [eax+0Ch]
005D105A mov edx,dword ptr [eax+8]
005D105D mov esi,dword ptr [eax+4]
005D1060 mov eax,dword ptr [eax]
005D1062 jmp 005D1070
005D1064 lea esi,[esi]
005D106A lea edi,[edi]
005D1070 mov edi,dword ptr [eax]
005D1072 mov ebx,edi
005D1074 shr edi,0Eh
005D1077 and edi,3FCh
005D107D inc dword ptr [esi+edi]
005D1080 movzx edi,bh
005D1083 inc dword ptr [edx+edi*4]
005D1086 movzx edi,bl
005D1089 inc dword ptr [ecx+edi*4]
005D108C add eax,4
005D108F jmp 005D1070
005D1091 nop
005D1092 nop
Well, part of my problem is that I had the variables threads and index declared both as parameters in the kernel invocation and as variables in the kernel body. Once I removed the definitions in the kernel body, I no longer get the exception.
Shouldn't the compiler flag this as an error?
Originally posted by: dwootton Well, part of my problem is that I had the variables threads and index declared both as parameters in the kernel invocation and as variables in the kernel body. Once I removed the definitions in the kernel body, I no longer get the exception.
Shouldn't the compiler flag this as an error?
dwootton,
Could you please give us complete test case(kernel code and runtime code) which shows your problem?
genaganna
Here's a test case that causes the access viloation. I realize this is ultimately a user error, but it took me a long while to figure out that the duplicate variables were causing the problem. I didn't associate duplicate variables with the resulting missing loop conditon test.
Also, I'm compiling the kernel source using OpenCL functions clCreateProgramFromSource and clBuildProgram which both return CL_SUCCESS because the diagnostic flagging the duplicate variables is a warning, and so I therefore have no notification that there is a warning diagnostic to look at. Should this diagnostic be an error or are there C/C++ name scoping rules where shadowing a parameter is valid usage?
/* /* TaskParallel.cpp /* #include <assert.h> #include <stdio.h> #include "TaskParallel.h" #ifndef NULL #define NULL 0 #endif const char *TaskParallel::kernelSource = " __kernel void" " computeHistogram(__global unsigned int *imageData, global int *redCounts, global int *greenCounts, global int *blueCounts, int globalCount, int threads, int index)" " {" " int histogramBase;" " int imageOffset;" " int count;" " int threads;" // Remove these two lines redeclaring threads and index " int index;" // and the problem goes away. " " " count = globalCount / threads;" " histogramBase = index * 256;" " imageOffset = count * index;" " for (int i = 0; i < count; i++) {" " unsigned char r;" " unsigned char g;" " unsigned char b;" " unsigned int pixel;" " " " pixel = imageData[imageOffset + i];" " r = pixel >> 16;" " g = pixel >> 8;" " b = pixel;" " redCounts[histogramBase + r]++;" " greenCounts[histogramBase + g]++;" " blueCounts[histogramBase + b]++;" " }" " }"; TaskParallel::TaskParallel(void) { } TaskParallel::~TaskParallel(void) { } void TaskParallel::runTest() { cl_int status; cl_platform_id platform; cl_device_id device; cl_context_properties properties[4]; cl_command_queue queue; cl_context context; cl_program program; cl_kernel kernel; cl_mem image; cl_mem histogram[3]; cl_event event[2]; int elementCount; int numThreads; int threadIndex; status = clGetPlatformIDs(1, &platform, NULL); assert(status == CL_SUCCESS); status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL); assert(status == CL_SUCCESS); properties[0] = CL_CONTEXT_PLATFORM; properties[1] = (cl_context_properties) platform; properties[2] = 0; properties[3] = 0; context = clCreateContext(properties, 1, &device, NULL, NULL, &status); assert (status == CL_SUCCESS); queue = clCreateCommandQueue(context, device, 0, &status); assert (status == CL_SUCCESS); program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &status); assert (status == CL_SUCCESS); status = clBuildProgram(program, 1, &device, NULL, NULL, NULL); assert (status == CL_SUCCESS); kernel = clCreateKernel(program, "computeHistogram", &status); assert (status == CL_SUCCESS); image = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, 100000 * sizeof(int), NULL, &status); // Assume two cores in CPU device assert (status == CL_SUCCESS); for (int i = 0; i < 3; i++) { histogram = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, 256 * 2 * sizeof(int), NULL, &status); assert (status == CL_SUCCESS); } elementCount = 100000; numThreads = 2; status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &image); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &histogram[0]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &histogram[1]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &histogram[2]); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 4, sizeof(int), &elementCount); assert (status == CL_SUCCESS); status = clSetKernelArg(kernel, 5, sizeof(int), &numThreads); assert (status == CL_SUCCESS); // Don't initialize data since data is irreleveant // Run two threads, each process 50000 elements threadIndex = 0; status = clSetKernelArg(kernel, 6, sizeof(int), &threadIndex); assert (status == CL_SUCCESS); status = clEnqueueTask(queue, kernel, 0, NULL, &event[0]); threadIndex = 1; status = clSetKernelArg(kernel, 6, sizeof(int), &threadIndex); assert (status == CL_SUCCESS); status = clEnqueueTask(queue, kernel, 0, NULL, &event[1]); assert (status == CL_SUCCESS); status = clWaitForEvents(2, event); assert(status == CL_SUCCESS); } int main(int argc, char *argv[]) { TaskParallel testCase; testCase.runTest(); printf("Done\n"); } /* * TaskParallel.h */ #pragma once #include <CL/cl.h> class TaskParallel { public: static const char *kernelSource; TaskParallel(void); ~TaskParallel(void); void runTest(); };
Originally posted by: dwootton genaganna
Here's a test case that causes the access viloation. I realize this is ultimately a user error, but it took me a long while to figure out that the duplicate variables were causing the problem. I didn't associate duplicate variables with the resulting missing loop conditon test.
Also, I'm compiling the kernel source using OpenCL functions clCreateProgramFromSource and clBuildProgram which both return CL_SUCCESS because the diagnostic flagging the duplicate variables is a warning, and so I therefore have no notification that there is a warning diagnostic to look at. Should this diagnostic be an error or are there C/C++ name scoping rules where shadowing a parameter is valid usage?
Dwootton,
Thanks for reporting this issue. I feel compiler should give error in this case. I reported this issue to developers.