Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Help with locks and global memory

Global does not update correctly

Hi All, I am looking for some help with this code


__kernel void ReduceBoundsFromGBuffer(
  __global int   *p_Semaphors,
  __global PartitionBoundsC* p_PartionBounds
  uint2 groupId;            // SV_GroupID.xy
  uint2 groupThreadId;      // SV_GroupThreadID.xy
  uint  groupIndex ;        // SV_GroupIndex

  groupId.x = get_group_id(0);
  groupId.y = get_group_id(1);

  groupThreadId.x = get_local_id(0);
  groupThreadId.y = get_local_id(1);

  groupIndex = get_local_size(0)*groupThreadId.y+groupThreadId.x;

  // Now write out the result from this pass
  if (groupIndex == 0)
     //  p_Semaphors[0] += 1.0f;
      p_PartionBounds[1].c_MinCoord.y += 1.0;
      p_PartionBounds[2].c_MinCoord.y  += 1.0f;
      p_PartionBounds[3].c_MinCoord.y  += 1.0f;
      p_PartionBounds[0].c_MinCoord.y  += 1.0f;

If I run this on cpu's only it works correctly

p_PartionBounds[1].c_MinCoord.y has number of threads run.


If I run it on a GPU (5870)

p_PartionBounds[1].c_MinCoord.y has 1,

all other y values are correct


if I uncomment the line 

//  p_Semaphors[0] += 1.0f;

p_PartionBounds[1].c_MinCoord.y has the correct value


Am I doing something obviouly wrong?


PS the semaphore code seems to be ok

10 Replies
Journeyman III

A followup.

I have now run this code on a nVidia card and it executes as expected. I am starting to think this is either an error in the compiler or the 5780


You're leaving me a little short of information. Is there one work item per group here? You're only adding 1 each time rather than the size of the group. Only the first work item in the group should enter that block, I think, so you only expect one work item in each group to hit the critical section - so you don't have any races within a work group. If your groups only had one work item that would make perfect sense anyway.

Other than that assuming your semaphore code is right it should be ok... without seeing your semaphore code that's hard to say, of course. How do you know it seems to be ok? What kind of test have you done?

The other thing I can think of straight off is that the compiler or hardware might be reordering reads and writes a little because they don't know your semaphore calls define a critical section it doesn't know it can't hoist code out of that block (and the += 1 is not dependent on the semaphore reads and writes and there is no control flow block there so there is no obvious dependency)? Try adding global fences before and after both semaphore calls (or put them inside the semaphore function). Maybe that will help.

The code is just a cut down version of a reduction program that was not working.

I expect the count to be the number of workgroups; which it is for all but the first entry, which is only incrmented once - as if only workgroup is run. I have tried 1 - 6000+ work groups each witjh 8x8 work items. The first count is always 1.

I guessed it might be a re-order problem. I tried a fence after the GetSemaphor code and also moved the semaphore code inline.

Is there a way to disable optimising?

As I said, the code runs fine on a nVidia Quadro 5000; but the Quadro is very touchy about errors in my OpenCL and locks the machine often. The 5780 is a lot move forgiving so I tend to debug on it first.

Long term the code I am writing has to run on both AMD and nVidia which is why I am using OpenCL.

Thanks for you thoughts.

void GetSemaphor(__global int * semaphor) { int occupied = atom_cmpxchg(semaphor,0,1); while(occupied > 0) { occupied = atom_cmpxchg(semaphor,0,1); } return; } void ReleaseSemaphor(__global int * semaphor) { int prevVal = atom_xchg(semaphor, 0); }


Micah may be able to give an update but as far as I know there's still no significant control of optimisation. The problem is that you have multiple compilers all doing heavy duty optimisations.

Does the generated ISA code look like it's doing what you expect?


How do I look at the ISA code?


Use the stream kernel analyzer that should be installed with the APP SDK.


There is nothing that I see would stop the compiler from optimizing the code to move your increments/stores outside of the critical section. There is no dependence at all between these two code blocks. Try inserting memory fences to see if that forces ordering correctly.

Hi Micah,

I have tried the attached code and array items 0 and 3 both have  a count of; items 1  and 2 have 6300.

I tried using the analyser but it gave me a compile error

"OpenCL Compile Error: Can't find the IL for Cypress."

I tried googling this but can't find anthing


__kernel void ClearPartitionBounds( __global __write_only PartitionBoundsC* p_PartionBounds ) { uint groupIndex = get_global_id(0); float maxFloat = 0.0f; // as_float(0x7F7FFFFF); // Float max p_PartionBounds[groupIndex].c_MinCoord = (float4)(maxFloat,maxFloat,maxFloat,0); p_PartionBounds[groupIndex].c_MaxCoord = (float4)(0.0f,0.0f,0.f,0.0f); } __kernel void ReduceBoundsFromGBuffer( __global int *p_Semaphors, __global PartitionBoundsC* p_PartionBounds ) { uint2 groupThreadId; // SV_GroupThreadID.xy uint groupIndex ; // SV_GroupIndex groupThreadId.x = get_local_id(0); groupThreadId.y = get_local_id(1); groupIndex = get_local_size(0)*groupThreadId.y+groupThreadId.x; // Now write out the result from this pass if (groupIndex == 0) { int occupied = atom_cmpxchg(&p_Semaphors[2],0,1); while(occupied > 0) { occupied = atom_cmpxchg(&p_Semaphors[2],0,1); } mem_fence(CLK_GLOBAL_MEM_FENCE); float temp = p_PartionBounds[3].c_MinCoord.x; mem_fence(CLK_GLOBAL_MEM_FENCE); temp = temp + 1; mem_fence(CLK_GLOBAL_MEM_FENCE); p_PartionBounds[3].c_MinCoord.x = temp; p_PartionBounds[0].c_MinCoord.x = temp; p_PartionBounds[1].c_MinCoord.x += 1.0f; p_PartionBounds[2].c_MinCoord.x += 1.0f; int prevVal = atom_xchg(&p_Semaphors[2], 0); } }{ cl_int status; C3D_OpenCL_Device* device = &c_OpenCL_Device; unsigned int threads = 16; size_t localMemBuffer = 0; do { threads /= 2; localMemBuffer = threads * threads * sizeof(PartitionBoundsC) * PARTITIONS; } while (localMemBuffer*2 > device->c_LocalMemSize); assert(localMemBuffer*2 < device->c_LocalMemSize); // setup parameter values int arg = 0; clSetKernelArg(c_kernel_ReduceBoundsFromGBuffer, arg++, sizeof(cl_mem), (void *)&c_mem_Semaphor); clSetKernelArg(c_kernel_ReduceBoundsFromGBuffer, arg++, sizeof(cl_mem), (void *)&c_mem_PartitionBounds); int dispatchWidth = (800 + threads - 1) / threads; int dispatchHeight = (500 + threads - 1) / threads; // execute kernel size_t globalWorkSize[2] = {dispatchWidth*threads, dispatchHeight*threads}; size_t localWorkSize[2] = {threads, threads}; status = clEnqueueNDRangeKernel(c_OpenCL_Device.c_CommandQueue,c_kernel_ReduceBoundsFromGBuffer, 2,0, globalWorkSize,localWorkSize, 0,0,0); if (CL_CheckError(status)) { return false; } #ifdef _DEBUG // copy results from device back to host status = clEnqueueReadBuffer(c_OpenCL_Device.c_CommandQueue,c_mem_PartitionBounds,CL_TRUE,0, c_Partitions * sizeof(PartitionBoundsC), c_PartitionBound, 0,0,0); #endif clFlush(c_OpenCL_Device.c_CommandQueue); return true; }



Refer to for the SKA issue.


Thanks Himanshu,

I have tried that and I now have the p-code. It did not mean much to me 😞  but the code works by putting a dummy statement after the lock.

I don't know why the mem-fence don't work.