Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

AMD APP Profiler changes application behaviour

Getting different results with Profiler running


I'll try to mention information that could be relevant.

My application uses multiple OpenCL kernels to process data that is also shared with OpenGL. These kernels are invoked one after another. For one of the last passes over the data there are 2 different algorithms

  1. Updates the whole buffer several times (data fits into full work groups)
  2. Updates the buffer selectively, skipping work by returning early in some work items (the intention is to safe time by whole work groups skipping work)

The kernel for these two algorithms is repeatedly invoked on the data.

For example and for both algorithms the work size is 256x256 and the work group size is 16x16. I can dynamically switch between both algorithms by changing a kernel argument (both algorithms are implemented in the same kernel). The kernels are being profiled with OpenCL queue events inside the application and the profiling results seem plausible.

But when I run the application with the AMD APP Profiler the 2nd algorithm behaves unpredictable and seems to give random results.

Unfortunately due to the size of the application and required dependencies I can not provide a test case at the moment.

Please consider the following questions

Did somebody else witness this problem?
Could there be some race condition inside the application that is exposed by the use of the Profiler?
Could it be the differing work load (skipping idle work groups), or does someone use similar kernels and can't see this problem?

Radeon 6950, Ubuntu 10.10, Catalyst 11.5, APP SDK 2.4, APP Profiler 2.2

Maybe someone with knowledge of the internals of the Profiler can tell if a race condition could be exposed by the Profiler. I believe the algorithm is correct, but you never know.



8 Replies

Hi Helmut
Thanks for reporting this issue. Which mode did you use? Collecting performance counter mode or API trace? Can you please perform an API trace on your application and send the .atp file to


Hey Ibin,

I've sent the .atp.

The problem does not appear when running the trace, only when retrieving
the performance counters.

Btw. everything is 64bit.


We looked at your atp file, you created 2 read only buffers, can you make sure you are not also writing to the buffers in the kernel.

The above problem may cause the profiler in the collecting performance counter mode to change the program's behavior.


By double checking buffer usage and even trying CL_MEM_READ_WRITE on them, I could rule out problems with the two CL_MEM_READ_ONLY buffers.

Next I made sure all buffers shared with OpenGL are acquired before being set as kernel arguments (unsure if this is required, when the buffer state does not change, only its contents) or used.

Finally found out that, regardless of which of the two algorithms are used, kernel 'kernelSolveLarge' is executed 3 times for each kernel invocation while APP Profiler is running in collecting performance counter mode. This was checked by counting up a number inside the buffer each time the kernel is run, and reading the number back in the client.

The code below lists the invocation of the kernel. This loop is called each frame. As you can see the two buffers are swapped each time (double-buffered).

    // ... acquire all involved OpenGL buffers ...

        while (true)
            if (isRequiringGroupStates)
                groupStatesReadBuffer  = groupStates[groupStatesReadIndex]->getBufferShared();
                groupStatesWriteBuffer = groupStates[groupStatesWriteIndex]->getBufferShared();

            cl_int updateMethod = static_cast (updateState);

            CL_ERROR( clSetKernelArg(kernelSolveLarge, 7, sizeof(cl_mem), &groupStatesReadBuffer ) );
            CL_ERROR( clSetKernelArg(kernelSolveLarge, 8, sizeof(cl_mem), &groupStatesWriteBuffer ) );
            CL_ERROR( clSetKernelArg(kernelSolveLarge, 9, sizeof(cl_int), &updateMethod ) );

            const size_t globalWorkSize[2] = { bufferSideLength,    bufferSideLength    };
            const size_t localWorkSize[2]  = { workGroupSideLength, workGroupSideLength };

            // Solve large
            CL_PROFILE(kernelSolveLargeProfilingId, isProfiling, queue,
                  CL_ERROR( clEnqueueNDRangeKernel(queue, kernelSolveLarge, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &event) )

            // Swap group state buffers

            if (isRequiringGroupStates)
                groupStatesReadIndex  = 1 - groupStatesReadIndex;
                groupStatesWriteIndex = 1 - groupStatesWriteIndex;

            // ... abort loop code ...

    // ... release all involved OpenGL buffers ...


Might be a coincidence, but when I log something to the console (std::cout) outside of the loop, it prints 3 times then hangs for a moment then prints again 3 times. Maybe this number 3 is related with the above 3.
Also, I am using Qt and there is some weird thing going on with their per-frame updates, sometimes calling updates twice per frame, maybe this interferes in some way.

Hopefully this can shed some light upon this problem.

EDIT: I also verified that the loop abort conditions are correct, by printing a loop counter inside the while loop with each step (I can run the program in single step mode, where the loop is exited after a single step, and this only printed one line per step). So it seems to be a server side problem.


Can you please send us the cl kernel code to

Thank you


Hi hduregger

We've found an issue in the profiler with regard to CL/GL interop. Can you please confirm that you've created a cl_mem object from clCreateFromGLBuffer and used it as both input and output(read from and write to) in any of your kernels. If possible, can you please send your executable to so that we can confirm this problem is resolved.


Originally posted by: lbin Hi hduregger Can you please confirm that you've created a cl_mem object from clCreateFromGLBuffer and used it as both input and output(read from and write to) in any of your kernels.

Yes that is right.


You probably already have a test case, still I've sent you
simple test case that exhibits the erratic behaviour.

It creates an OpenGL texture buffer for 1 float, then initializes
the float to 0.0f. Each frame an OpenCL kernel increments the float in the buffer by 1.0. Then the float is read back and printed to the console. More details in the README.

My application does more or less the same, just in a bigger context.
So I think if this case passes then my application should be also fine.


EDIT: NOTE: Please ignore the GLUT window content.


Hi hduregger

We've confirmed that we have fixed the issue you reported. Profiling the test app you sent to us generates expected results. The fix will be included in the coming release. Thank for providing feedbacks to help us build a better tool.