8 Replies Latest reply on Jun 20, 2011 6:29 PM by lbin

    AMD APP Profiler changes application behaviour

    hduregger
      Getting different results with Profiler running

      Hello,

      I'll try to mention information that could be relevant.

      My application uses multiple OpenCL kernels to process data that is also shared with OpenGL. These kernels are invoked one after another. For one of the last passes over the data there are 2 different algorithms

      1. Updates the whole buffer several times (data fits into full work groups)
      2. Updates the buffer selectively, skipping work by returning early in some work items (the intention is to safe time by whole work groups skipping work)

      The kernel for these two algorithms is repeatedly invoked on the data.

      For example and for both algorithms the work size is 256x256 and the work group size is 16x16. I can dynamically switch between both algorithms by changing a kernel argument (both algorithms are implemented in the same kernel). The kernels are being profiled with OpenCL queue events inside the application and the profiling results seem plausible.

      But when I run the application with the AMD APP Profiler the 2nd algorithm behaves unpredictable and seems to give random results.

      Unfortunately due to the size of the application and required dependencies I can not provide a test case at the moment.

      Please consider the following questions

      Did somebody else witness this problem?
      Could there be some race condition inside the application that is exposed by the use of the Profiler?
      Could it be the differing work load (skipping idle work groups), or does someone use similar kernels and can't see this problem?

      Radeon 6950, Ubuntu 10.10, Catalyst 11.5, APP SDK 2.4, APP Profiler 2.2

      Maybe someone with knowledge of the internals of the Profiler can tell if a race condition could be exposed by the Profiler. I believe the algorithm is correct, but you never know.

      Regards,

      Helmut

        • AMD APP Profiler changes application behaviour
          lbin

          Hi Helmut
          Thanks for reporting this issue. Which mode did you use? Collecting performance counter mode or API trace? Can you please perform an API trace on your application and send the .atp file to gputools.support@amd.com.

            • AMD APP Profiler changes application behaviour
              hduregger

              Hey Ibin,

              I've sent the .atp.

              The problem does not appear when running the trace, only when retrieving
              the performance counters.

              Btw. everything is 64bit.

                • AMD APP Profiler changes application behaviour
                  lbin

                  We looked at your atp file, you created 2 read only buffers, can you make sure you are not also writing to the buffers in the kernel.

                  The above problem may cause the profiler in the collecting performance counter mode to change the program's behavior.

                    • AMD APP Profiler changes application behaviour
                      hduregger

                      By double checking buffer usage and even trying CL_MEM_READ_WRITE on them, I could rule out problems with the two CL_MEM_READ_ONLY buffers.

                      Next I made sure all buffers shared with OpenGL are acquired before being set as kernel arguments (unsure if this is required, when the buffer state does not change, only its contents) or used.

                      Finally found out that, regardless of which of the two algorithms are used, kernel 'kernelSolveLarge' is executed 3 times for each kernel invocation while APP Profiler is running in collecting performance counter mode. This was checked by counting up a number inside the buffer each time the kernel is run, and reading the number back in the client.

                      The code below lists the invocation of the kernel. This loop is called each frame. As you can see the two buffers are swapped each time (double-buffered).



                          // ... acquire all involved OpenGL buffers ...

                              while (true)
                              {
                                  if (isRequiringGroupStates)
                                  {
                                      groupStatesReadBuffer  = groupStates[groupStatesReadIndex]->getBufferShared();
                                      groupStatesWriteBuffer = groupStates[groupStatesWriteIndex]->getBufferShared();
                                  }

                                  cl_int updateMethod = static_cast (updateState);

                                  CL_ERROR( clSetKernelArg(kernelSolveLarge, 7, sizeof(cl_mem), &groupStatesReadBuffer ) );
                                  CL_ERROR( clSetKernelArg(kernelSolveLarge, 8, sizeof(cl_mem), &groupStatesWriteBuffer ) );
                                  CL_ERROR( clSetKernelArg(kernelSolveLarge, 9, sizeof(cl_int), &updateMethod ) );

                                  const size_t globalWorkSize[2] = { bufferSideLength,    bufferSideLength    };
                                  const size_t localWorkSize[2]  = { workGroupSideLength, workGroupSideLength };

                                  // Solve large
                                  CL_PROFILE(kernelSolveLargeProfilingId, isProfiling, queue,
                                        CL_ERROR( clEnqueueNDRangeKernel(queue, kernelSolveLarge, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &event) )
                                  );

                                  // Swap group state buffers

                                  if (isRequiringGroupStates)
                                  {
                                      groupStatesReadIndex  = 1 - groupStatesReadIndex;
                                      groupStatesWriteIndex = 1 - groupStatesWriteIndex;
                                  }

                                  // ... abort loop code ...
                              }

                          // ... release all involved OpenGL buffers ...

                       

                      Might be a coincidence, but when I log something to the console (std::cout) outside of the loop, it prints 3 times then hangs for a moment then prints again 3 times. Maybe this number 3 is related with the above 3.
                      Also, I am using Qt and there is some weird thing going on with their per-frame updates, sometimes calling updates twice per frame, maybe this interferes in some way.

                      Hopefully this can shed some light upon this problem.

                      EDIT: I also verified that the loop abort conditions are correct, by printing a loop counter inside the while loop with each step (I can run the program in single step mode, where the loop is exited after a single step, and this only printed one line per step). So it seems to be a server side problem.

                        • AMD APP Profiler changes application behaviour
                          lbin

                          Can you please send us the cl kernel code to gputools.support@amd.com.

                          Thank you

                          • AMD APP Profiler changes application behaviour
                            lbin

                            Hi hduregger

                            We've found an issue in the profiler with regard to CL/GL interop. Can you please confirm that you've created a cl_mem object from clCreateFromGLBuffer and used it as both input and output(read from and write to) in any of your kernels. If possible, can you please send your executable to gputools.support@amd.com so that we can confirm this problem is resolved.
                            Thanks

                              • AMD APP Profiler changes application behaviour
                                hduregger

                                 

                                Originally posted by: lbin Hi hduregger Can you please confirm that you've created a cl_mem object from clCreateFromGLBuffer and used it as both input and output(read from and write to) in any of your kernels.


                                Yes that is right.

                                 

                                You probably already have a test case, still I've sent you
                                simple test case that exhibits the erratic behaviour.

                                It creates an OpenGL texture buffer for 1 float, then initializes
                                the float to 0.0f. Each frame an OpenCL kernel increments the float in the buffer by 1.0. Then the float is read back and printed to the console. More details in the README.

                                My application does more or less the same, just in a bigger context.
                                So I think if this case passes then my application should be also fine.

                                 

                                EDIT: NOTE: Please ignore the GLUT window content.