10 Replies Latest reply on Jul 21, 2011 12:30 PM by tonyo_au

    Help with locks and global memory

    tonyo_au
      Global does not update correctly

      Hi All, I am looking for some help with this code

       

      __kernel void ReduceBoundsFromGBuffer(
        __global int   *p_Semaphors,
        __global PartitionBoundsC* p_PartionBounds
        )
      {
        uint2 groupId;            // SV_GroupID.xy
        uint2 groupThreadId;      // SV_GroupThreadID.xy
        uint  groupIndex ;        // SV_GroupIndex

        groupId.x = get_group_id(0);
        groupId.y = get_group_id(1);

        groupThreadId.x = get_local_id(0);
        groupThreadId.y = get_local_id(1);

        groupIndex = get_local_size(0)*groupThreadId.y+groupThreadId.x;

        // Now write out the result from this pass
        if (groupIndex == 0)
        {
             GetSemaphor(&p_Semaphors[2]);
         {
           //  p_Semaphors[0] += 1.0f;
            p_PartionBounds[1].c_MinCoord.y += 1.0;
            p_PartionBounds[2].c_MinCoord.y  += 1.0f;
            p_PartionBounds[3].c_MinCoord.y  += 1.0f;
            p_PartionBounds[0].c_MinCoord.y  += 1.0f;
          }
           ReleaseSemaphor(&p_Semaphors[2]);
        }
      }

      If I run this on cpu's only it works correctly

      p_PartionBounds[1].c_MinCoord.y has number of threads run.

       

      If I run it on a GPU (5870)

      p_PartionBounds[1].c_MinCoord.y has 1,

      all other y values are correct

       

      if I uncomment the line 

      //  p_Semaphors[0] += 1.0f;

      p_PartionBounds[1].c_MinCoord.y has the correct value

       

      Am I doing something obviouly wrong?

       

      PS the semaphore code seems to be ok

        • Help with locks and global memory
          tonyo_au

          A followup.

          I have now run this code on a nVidia card and it executes as expected. I am starting to think this is either an error in the compiler or the 5780

            • Help with locks and global memory
              LeeHowes

              You're leaving me a little short of information. Is there one work item per group here? You're only adding 1 each time rather than the size of the group. Only the first work item in the group should enter that block, I think, so you only expect one work item in each group to hit the critical section - so you don't have any races within a work group. If your groups only had one work item that would make perfect sense anyway.

              Other than that assuming your semaphore code is right it should be ok... without seeing your semaphore code that's hard to say, of course. How do you know it seems to be ok? What kind of test have you done?

              The other thing I can think of straight off is that the compiler or hardware might be reordering reads and writes a little because they don't know your semaphore calls define a critical section it doesn't know it can't hoist code out of that block (and the += 1 is not dependent on the semaphore reads and writes and there is no control flow block there so there is no obvious dependency)? Try adding global fences before and after both semaphore calls (or put them inside the semaphore function). Maybe that will help.

                • Help with locks and global memory
                  tonyo_au

                  The code is just a cut down version of a reduction program that was not working.

                  I expect the count to be the number of workgroups; which it is for all but the first entry, which is only incrmented once - as if only workgroup is run. I have tried 1 - 6000+ work groups each witjh 8x8 work items. The first count is always 1.

                  I guessed it might be a re-order problem. I tried a fence after the GetSemaphor code and also moved the semaphore code inline.

                  Is there a way to disable optimising?

                  As I said, the code runs fine on a nVidia Quadro 5000; but the Quadro is very touchy about errors in my OpenCL and locks the machine often. The 5780 is a lot move forgiving so I tend to debug on it first.

                  Long term the code I am writing has to run on both AMD and nVidia which is why I am using OpenCL.

                  Thanks for you thoughts.

                  void GetSemaphor(__global int * semaphor) { int occupied = atom_cmpxchg(semaphor,0,1); while(occupied > 0) { occupied = atom_cmpxchg(semaphor,0,1); } return; } void ReleaseSemaphor(__global int * semaphor) { int prevVal = atom_xchg(semaphor, 0); }

              • Help with locks and global memory
                MicahVillmow
                tonyo_au,
                There is nothing that I see would stop the compiler from optimizing the code to move your increments/stores outside of the critical section. There is no dependence at all between these two code blocks. Try inserting memory fences to see if that forces ordering correctly.
                  • Help with locks and global memory
                    tonyo_au

                    Hi Micah,

                    I have tried the attached code and array items 0 and 3 both have  a count of; items 1  and 2 have 6300.

                    I tried using the analyser but it gave me a compile error

                    "OpenCL Compile Error: Can't find the IL for Cypress."

                    I tried googling this but can't find anthing

                    Tony

                    __kernel void ClearPartitionBounds( __global __write_only PartitionBoundsC* p_PartionBounds ) { uint groupIndex = get_global_id(0); float maxFloat = 0.0f; // as_float(0x7F7FFFFF); // Float max p_PartionBounds[groupIndex].c_MinCoord = (float4)(maxFloat,maxFloat,maxFloat,0); p_PartionBounds[groupIndex].c_MaxCoord = (float4)(0.0f,0.0f,0.f,0.0f); } __kernel void ReduceBoundsFromGBuffer( __global int *p_Semaphors, __global PartitionBoundsC* p_PartionBounds ) { uint2 groupThreadId; // SV_GroupThreadID.xy uint groupIndex ; // SV_GroupIndex groupThreadId.x = get_local_id(0); groupThreadId.y = get_local_id(1); groupIndex = get_local_size(0)*groupThreadId.y+groupThreadId.x; // Now write out the result from this pass if (groupIndex == 0) { int occupied = atom_cmpxchg(&p_Semaphors[2],0,1); while(occupied > 0) { occupied = atom_cmpxchg(&p_Semaphors[2],0,1); } mem_fence(CLK_GLOBAL_MEM_FENCE); float temp = p_PartionBounds[3].c_MinCoord.x; mem_fence(CLK_GLOBAL_MEM_FENCE); temp = temp + 1; mem_fence(CLK_GLOBAL_MEM_FENCE); p_PartionBounds[3].c_MinCoord.x = temp; p_PartionBounds[0].c_MinCoord.x = temp; p_PartionBounds[1].c_MinCoord.x += 1.0f; p_PartionBounds[2].c_MinCoord.x += 1.0f; int prevVal = atom_xchg(&p_Semaphors[2], 0); } }{ cl_int status; C3D_OpenCL_Device* device = &c_OpenCL_Device; unsigned int threads = 16; size_t localMemBuffer = 0; do { threads /= 2; localMemBuffer = threads * threads * sizeof(PartitionBoundsC) * PARTITIONS; } while (localMemBuffer*2 > device->c_LocalMemSize); assert(localMemBuffer*2 < device->c_LocalMemSize); // setup parameter values int arg = 0; clSetKernelArg(c_kernel_ReduceBoundsFromGBuffer, arg++, sizeof(cl_mem), (void *)&c_mem_Semaphor); clSetKernelArg(c_kernel_ReduceBoundsFromGBuffer, arg++, sizeof(cl_mem), (void *)&c_mem_PartitionBounds); int dispatchWidth = (800 + threads - 1) / threads; int dispatchHeight = (500 + threads - 1) / threads; // execute kernel size_t globalWorkSize[2] = {dispatchWidth*threads, dispatchHeight*threads}; size_t localWorkSize[2] = {threads, threads}; status = clEnqueueNDRangeKernel(c_OpenCL_Device.c_CommandQueue,c_kernel_ReduceBoundsFromGBuffer, 2,0, globalWorkSize,localWorkSize, 0,0,0); if (CL_CheckError(status)) { return false; } #ifdef _DEBUG // copy results from device back to host status = clEnqueueReadBuffer(c_OpenCL_Device.c_CommandQueue,c_mem_PartitionBounds,CL_TRUE,0, c_Partitions * sizeof(PartitionBoundsC), c_PartitionBound, 0,0,0); #endif clFlush(c_OpenCL_Device.c_CommandQueue); return true; }