Archives Discussions

pavandsp · ‎04-02-2010

I am seeing inconsistence in the reading the histogram buffers like rhist ,ghist, bhist buffers from kernel.For the same input data i am seeing variations in the values in these buffers.

The code in the kernel is shown below.

do the unary operators behave correctly in OpenCL Kernel.

rhist[ output[index + 0] ]++;
ghist[ output[index + 1] ]++;
bhist[ output[index + 2] ]++;

For first run .

when i=1 rhist: 3935 ghist: 3060 bhist: 2884

i=2 rhist: 7533 ghist: 8436 bhist: 6656

For seconf run. i am seeing these incosistency

when i=1 rhist: 3935 ghist: 3062 bhist: 2885

i=2 rhist: 7532 ghist: 8438 bhist: 6656

**********************************

Please find the Application Code.

1.Creating BUffer:

rhistBuffer = clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,sizeof(cl_int) * 256 ,rhist,&status);

ghistBuffer = clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,sizeof(cl_int) * 256 ,ghist,&status);
bhistBuffer = clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,sizeof(cl_int) * 256 ,bhist,&status);

2.Setting the Kernel:

/* the rhist array to the kernel */
    status = clSetKernelArg(kernel,6, sizeof(cl_mem),(void *)&rhistBuffer);
/* the ghist array to the kernel */
    status = clSetKernelArg(kernel,7, sizeof(cl_mem),(void *)&ghistBuffer);
/* the bhist array to the kernel */
    status = clSetKernelArg(kernel,8, sizeof(cl_mem),(void *)&bhistBuffer);

3.Reading from GPU.

status = clEnqueueReadBuffer(commandQueue,rhistBuffer,CL_TRUE, 0,256 * sizeof(cl_int),rhist, 0, NULL, &events[1]);
status = clWaitForEvents(1, &events[1]);
clReleaseEvent(events[1]);

status = clEnqueueReadBuffer(commandQueue,ghistBuffer,CL_TRUE, 0,256 * sizeof(cl_int),ghist, 0, NULL, &events[1]);
status = clWaitForEvents(1, &events[1]);
clReleaseEvent(events[1]);

status = clEnqueueReadBuffer(commandQueue,bhistBuffer,CL_TRUE, 0,256 * sizeof(cl_int),bhist, 0, NULL, &events[1]);
status = clWaitForEvents(1, &events[1]);
clReleaseEvent(events[1]);

Please let me where i am going wrong...I verified ..but i am not getting the correct values from the 3 buffers.

MicahVillmow · ‎04-02-2010

pavandsp,
" rhist[ output[index + 0] ]++;
ghist[ output[index + 1] ]++;
bhist[ output[index + 2] ]++;"

These operations are read-modify-write operations on memory, and thus you have race condition on the write on the GPU if more than one thread writes to the same output location. What you want to use is the atomic_inc instruction to get consistency.

pavandsp · ‎04-03-2010

Micah Villmow,

Thanks ..but my GPU doesn't support atomic pragma cl_khr_global_int32_base_atomics : found from ./CLInfo

I am getting the followig error in GPU build

,error: bad argument type to opencl atom op: expected pointer to int/uint with addrSpace global/local atom_inc(rhist[output[aa + 0]]);

In GPU case how should i solve this race condition since this extension is not available.

pavandsp · ‎04-03-2010

In CPU Outputs are consistence with atom_inc functions. I modified to

atom_inc(rhist+ *(output+(index + 0)));
atom_inc(ghist+ *(output+(index + 1)));
atom_inc(bhist+ *(output+(index+ 2)));

but ...I am still clueless how to perform in GPU.

pavandsp · ‎04-05-2010

Hi All

Please let me know some pointers on the solution...also i am not getting correct values from this operation in GPU.How to avoid race condition in GPU without atom_inc functions.

rhist[ output[index + 0] ]++;
ghist[ output[index + 1] ]++;
bhist[ output[index + 2] ]++;

MicahVillmow · ‎04-05-2010

if you do not have atomic operations on your GPU(all the 4XXX series), then you cannot have more than one thread writing to the same global memory location or there is a race condition.

pavandsp · ‎04-06-2010

Hi Micah Villmow,

Thanks for clarification ..so i will pull back this code into host side but at the cost of performance hit.

Regards

Pavan

MicahVillmow · ‎04-06-2010

pavandsp,
Another thing you can do is take a three phase approach to histogram.

Phase 1: each thread reads in say 64/128 or 256 data points strided at the number of threads and collects a bin count global memory that only that thread writes two
Phase 2: launch a single group per SIMD that collects all the bins and writes out one groups worth of bins on a SIMD
Phase 3: launch a single group that collects all the bins and writes out the final bin count.

pavandsp · ‎04-08-2010

Hi Micah Villmow,

I am new to parallel programming and so couldn't understand the approach.Can you please list me the steps in terms of OpenCL functions .

Thanks

Pavan

MicahVillmow · ‎04-08-2010

pavandsp,
There are three seperate kernels, the first one gets called with NDRange of global(width, height):local(256, 1), the second one gets called with NDRange of global(256 * numComputeUnits,1):local(256,1), the third kernel gets called with NDRange of global(256,1):local(256:1).

Basically you are using kernel launches as a global sync point to make sure everything in the previous step is finished.

Archives Discussions

Unary operators have any issue in the kernel