Showing results forΒ 
Search instead forΒ 
Did you mean:Β 


Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

I see why you find it hard to believe. It makes little sense to me as well.

I was finishing up the example code showing the issue in its absolutely most simple form and it turns out that the problem seems to morph as I keep peeling layers of this issue.

What I have now is a very basic sample showing that actual kernel time does not change based on allocations, but launching time does. This only happens for AMD and only when running from inside Vs2012 AND when running from the game engine Unity. Running as external program it has no issues.

The actual project, however, before stripping off all the layers, shows the same behavior (slow kernel run when allocating memory for cpu calculations elsewhere) both when running as a standalone exe, running from vs2012 and from unity. The problem is that I cannot share that entire codebase. πŸ˜•

I can share both kernel and launch code in a stripped down version, if you or anyone else wants to take a look. It seems I cannot format text as code in here..???

Thanks for your time thus far. It does seem like there is an issue here, but it is very hard to reproduce in a form that I can share. That is a problem πŸ˜•


Prototype code.

public class GravityCalculatorGPU


        public GravityCalculatorGPU(int platformIndex, int deviceIndex)


            string clProgramSource  = the kernel from a resource;

            ComputePlatform platform = ComputePlatform.Platforms[platformIndex];

            IList<ComputeDevice> devices = new List<ComputeDevice>();

            device = platform.Devices[deviceIndex];


            ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);

            context = new ComputeContext(devices, properties, null, IntPtr.Zero);

            ComputeProgram program = new ComputeProgram(context, clProgramSource);



                program.Build(null, null, null, IntPtr.Zero);





               rethrow exception...


            bufferPosition = new ComputeBuffer<Vector3>(context, ComputeMemoryFlags.ReadOnly,1000000);

            bufferMass = new ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly, 1000000);

            bufferAcceleration = new ComputeBuffer<Vector3>(context, ComputeMemoryFlags.WriteOnly, 1000000);

            // Create the kernel function and set its arguments.

            kernel = program.CreateKernel("gravity");

            commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.Profiling);//.None);

            eventList = new ComputeEventList();


        struct Vector3


            double x, y, z, q;//placeholder for actual vector structure


        private ComputeContext context;

        private ComputeBuffer<Vector3> bufferPosition;

        private ComputeBuffer<double> bufferMass;

        private ComputeBuffer<Vector3> bufferAcceleration;

        private ComputeKernel kernel;

        private ComputeCommandQueue commands;

        private ComputeEventList eventList;

        private ComputeDevice device;

        private System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();

        public int Acceleration()


            int workGroupSize = 64;

            int globalSize = 4096*2*2;

            kernel.SetMemoryArgument(0, bufferPosition);

            kernel.SetMemoryArgument(1, bufferMass);

            kernel.SetMemoryArgument(2, bufferAcceleration);


            kernel.SetValueArgument(4, 6.67E-11);

            kernel.SetValueArgument(5, (double)1000);

            kernel.SetLocalArgument(6, sizeof(double) * 4 * workGroupSize);

            //commands.WriteToBuffer<Vector3>(system.Position, bufferPosition, true, 0, 0, system.ObjectCount, null);

            //commands.WriteToBuffer<double>(system.Mass, bufferMass, true, 0, 0, system.ObjectCount, null);

            ComputeEventList events = new ComputeEventList();



            commands.Execute(kernel, null, new long[] { globalSize }, new long[] { workGroupSize }, events);

          //  commands.ReadFromBuffer<Vector3>(bufferAcceleration, ref acceleration, true,0,0,system.ObjectCount,null);



            long que = events[0].EnqueueTime;

            long submit = events[0].SubmitTime - events[0].EnqueueTime;

            long start = events[0].StartTime - events[0].SubmitTime;

            long done = events[0].FinishTime - events[0].StartTime;

            Console.WriteLine(device.Name.PadRight(10).Substring(0, 10).PadRight(10) + " Sub {0} Lau {1} Run {2} All {3}", (int)(submit / 1E6), (int)(start / 1E6), (int)(done / 1E6), sw.ElapsedMilliseconds);

            return (int)(done/1000000);



#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void gravity(__global double4* position, __global double* mass, __global double4* acceleration, int firstParticle, double G, double epsilonSquared, __local double4* pblock)


int globalIndex                    = get_global_id(0);

int localIndex                    = get_local_id(0);

int totalSize                    = get_global_size(0);

int localSize                    = get_local_size(0);

int numBlocks                    = totalSize/localSize;

double4 p_i = position[globalIndex];

double4 a = (double4)(0,0,0,0);


          int numBlocksWithMass = ceil((float)firstParticle/localSize);

    for(int jb=0; jb < numBlocksWithMass; jb++) //foreach block containing mass, we need to find the contribution to a


        pblock[localIndex]                    = position          [jb*localSize+localIndex];

        pblock[localIndex].w          = mass                    [jb*localSize+localIndex];


                    barrier(CLK_LOCAL_MEM_FENCE); //wait untill every work item has fetched its part

        for(int j=0; j<localSize; j++)  //foreach fetched object, calculate




                double4 p_j = pblock;

                double4 d = p_j - p_i;

                                                        double invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z+epsilonSquared);

                    double f = p_j.w*invr*invr*invr;

                     a += G*f*d;



         barrier(CLK_LOCAL_MEM_FENCE); //wait for all work items to complete calculating before beinging in a new batch of neighbors


   acceleration[globalIndex] = a;


0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch

Hello Thomas,

1. Are you running the kernel on the GPU that is used for display?

2. When you said "NVIDIA's platform did not have this problem" -- Was NVIDIA GPU the display GPU? Was it in a different machine? Was it also run from Visual studio (or) was it run independently?

My guess is that either Visual studio (or) game engine unity (no idea what this is though) --- is trying to use the GPU when the kernel is running... May be, this is slowing you down.

But this does not explain how this is related to the CPU load...but then, you have said the problem is morphing...

Can you please answer these questions? It will help understand teh setup?

0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch

Unity3D engine

0 Kudos
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

I will reply to my own post with a bit more details, and a program and source showing the problem. This is the cloo (.net wrapper of openCL) demo application where it simply adds together two vectors. I made slight modifications to the vector addition demo.

The zip contains source for both cloo and the demo. Inside the clooutils/bin/release there is an exe showing the issue, in one form.

Select the device to run on and select vector addition as the demo to run.

What I see is that in this demo, the kernel launch time for my AMD Radeon 7990 HD GHz edition is always very large (4.7ms)  while for my NVidia GTX 580 it is low (0.2ms).

It seems that regardless of the work the kernel does, the launch time is 4-5 ms.

The other thing i note is that when I time this kernel like


start timer;

commands.Execute(kernel, null, new long[] { count }, new long[] { 64 }, eventList);            


stop timer;

print elapsed time

I get an elapsed time with much greater than the time the event reports from kernel launch submit to queue to kernel completion.

I may have a result like the following for nvidia


0,01936 0,297312 1,78352 2,100192

meaning total measured time was 114ms

Time before queues on device is 0,019ms

Time before running on device is 0,297ms

Time to run kernel is 1,78 ms

The time from submitting to queue till kernel completes is 2,10ms

When using amd i get


0,016988 4,749113 1,595111 6,361212

Again the time I measure is much greater than the time the kernel event report, even though i use clFinish both before and after kernel and timing. Additionally the launchtime is not 4.74ms as opposed to nvidias 0.29ms

This problem is not exactly what I see in my actual code, but likely other issues confuses the issue. This is however publicly available code which shows opencl timing being odd in general and launchtime being very odd for amd.

One additional thing i noted is that if I change the buffer declerations I can make the problem when worse.

Currently the buffers are declared as

ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrA);

If I change that to

ComputeBuffer<double> a = new ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly, count);

The timing for nvidia is


0,018016 0,365472 32,480352 32,86384

and for amd


0,011928 19,886672 25,796296 45,694896

Now the time I measure is actually identical to the time the kernel event reports, but the launch time and running time is much greater.

I am really confused by this, so I home someone can at least confirm that it is not some issue on my system and additionally it would be nice to understand why this happens.

Thanks in advance for any help I can get πŸ™‚

0 Kudos
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

I have made a top thread post with more information and a sample.

Both the nvidia and the amd card is running a display. Same size as well.

The tests were run identically on the different gpus. Each run one display and both are in same system being run from same exe.

The trouble with explaining this is that as I was trying to create an example of the issue, it keeps changing slightly as I peeled layers off. In the end I had a sample with the same issue of amd being slow, but in a different way with slow launching kernels.

In the end I have modified the cloo demo project in a way that shows quite clearly that amd has huge overhead for kernel launch, and kernel event info not being right.

I hope that can clear it up a bit, so we can avoid talking about my code and just look at the problem... or perhaps this is not a problem on other systems... which is also worth finding out.

Thanks for the interest you have taken in this issue πŸ™‚

0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch


If you are housing both GPUs in a single system, I believe you should have made 1 as primary and other as secondary.

So, Is AMD primary or secondary?

Can you flip this and test? (assuming it is easy to do so)

Repro Related


For sake of clarity, Can you please upload a ZIP file and a set of steps that you need to do to repro?

Also, Please publish your

1. DOT NET version .

2. Service pack of your window installation if any

3. Visual Studio version

4. (I dont understand game unity. How is it related to your app? Is it required for repro?)

5. Bitness of your system

0 Kudos
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

The amd is primary, but it is not easily changed.

I am sorry, but I cannot share my entire code base, as this is not just a hobby project.

I have, however, uploaded a modified cloo demo which shows many of the same symptoms.

The code is built for .net 3.5, windows 8, vs2012, 64 bit.

Unity is a game engine which is used for various visualization sinze it can easily import .net classlibraries.

I do however beleive this is not really relevant for the issue any longer, since I see more or less the same in the modified cloo demo which I uploaded.

I am sorry I cannot be more informative. I was initially expecting this to be something which someone could recognize from the description and say "oh, that problem is caused by...something".

It would be interesting to hear how the cloo demo behaves on other systems, because if it is entirely different, then obviously its not a coding issue but rather a system/driver related problem.

Could you try that demo perhaps? It will (on a windows system) take a few seconds to start and run it.

Thanks for your time still πŸ™‚

0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch

I somehow think AMD being primary could be the problem.

Can you just remove AMD card, and just run only with NVIDIA and see if you can replicate the problem.

I hope that in this case NVIDIA would be primary.

If it hits on NVIDIA too, you know that your problem is actually "windows" πŸ™‚

+ Do you mean to say "the prototype code" you had posted somewhere above is the repro case?

I hope it is compilable.

0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch

Hi Thomas,

Long Kernel launch times have been a common concern for both AMD and NVIDIA. You could check kernelLaunch sample present in APP SDK, and compare launch times of AMD & NVIDIA cards. Also can you please let us know, that you see this difference in launch times, with just one card attached at a time, so we can know whether it is multi-vendor setup issue.

Also as stated earlier, cl_event counters may be not be trustworthy and also measuring launch time for a single command may take more time than usual, as commands are generally dispatched in batches. Can you run the EnqueueNDRAngeKernel function in a loop and then check the timings?

EDIT: On my machine with HF 5770 + Catalyst 13.1, Win7 64 bit , kernel Launch gives values around 47us. When the kernel batch size is increased, this value is as low as 6us.

Message was edited by: Himanshu Gautam

0 Kudos

Re: GPU kernel has low performance if CPU has worked _before_ launch

Try clFlush() API after execute. I hope the cloo wrapper's execute API does this.

I have seen differences in timing if clFlush() is not used in Amd Platform.

Also, Buffer flags can affect performances wildly. Table 4.2 of APP programming guide specifies how AMD's runtime interprets the flag...And, the section on performance-of-memory-objects details on how RT actually transfers data.

0 Kudos