Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

GPU kernel has low performance if CPU has worked _before_ launch

It was hard to find a good single line title for my question, but the longer form is this.

I have an application which first does some heavy calculation on the cpu. Then it calls an opencl kernel. The two calculation jobs have nothing to do with eachother. They just both have do be done. The cpu portion is slow and the gpu is fast.

This was initially developed using a Nvidia GTX 580 and, naturally it didn't affect the running time for the kernel whether the cpu had been calculating before I called the kernel or if it had been idle. The kernel took the same time regardless.

Now I started testing with an AMD Radeon 7970 hd and to my surprise it was running its kernel a bit slower than what I saw with the gtx 580 card.

After some investigation it turns out that the AMD card is faster than the Nvidia if the cpu had not done any computation before launcing the kernel, but if the cpu had been working before kernel launch, then AMD was slower.

More concrete,

the AMD kernel run took 9ms if the cpu had been working and 3 if the cpu had been sitting idle.

the nvidia kernel run took 8ms if the cpu had been working and 8ms if the cpu had been sitting idle.

I validated these measurements using CodeXL and it shows the same timing as I measured inside my program.

I tested various "kinds" of cpu work and even if I just keep doing the same calculation on a single variable, this slowdown happens, so it doesnt seem to be related to the amount of bytes in large buffers being moved around.


windows 8, Intel core2 duo using cloo

Latest drivers for amd and nvidia

I must say I am really confused as to how this makes any sense...?

24 Replies

do you use all CPU cores? for example LuxRender is quite slowed down if you don't leave single CPU core free for handling GPU.

Journeyman III

Commonly yes, but I have tested with just a single thread.

Besides, the real puzzle is that the cpu runs its job and when that is done I launch the kernel which is then slow.

I even tired adding in a delay of 1 second between cpu finishing and kernel starting. The issue remained.


I might add that now that I look at the kernel launch event, I see that the time passing before the kernel is actually started on the device, it remains somewhat constant, but the actual time from kernel start to done triples when the cpu has been busy before launching the kernel.


That's interesting.

Please post a copy of your code (as zip file) so that we can reproduce here.

Please include the following details as well.

Platform - win32 / win64 / lin32 / lin64 or some other?

Win7 or win vista or Win8.. Similarly for linux, your distribution

Version of driver and APP SDK


I almost have the code to show the problem, but it turns out that the cause was not the cpu doing previous work, as I initially thought. It has to do with memory allocation.

If I allocate and free memory (a lot) before launching the kernel, then AMD has major problems and Nvidia doesnt notice it.

I will upload a zip later, if still relevant, but right now I can show the problem easily in source

In the following, if the bold line is active, amd has problems, nvidia does not. If it is commented out, neither has problems.

Obviously, in the actual code I am not doing something silly like this, but likely something which provokes the same issue.

After the source, I show the event timing. Tahiti is amd and geforce is nvidia. First 5 lines or each is with the allocation and the 5 next are without.

class PerformanceTestLauncher


        static double[] a = new double[10000000]; 

        static void test(GravityCalculatorGPU gcGPU)


            a = new double[10000000];    //try commenting this line out

            gcGPU.Acceleration(); //this launches the kernel and prints event timing


        static void Main(string[] args)


            GravityCalculatorGPU gcGPU = new GravityCalculatorGPU(1,0);

            for (int i = 0; i < 30; i++)






Sub is time to submit, lau time to launch and run time to run. Each value is divided by 1E+6 to attemt getting it in ms.


Tahit Sub 0 Lau 21 Run 6

Tahit Sub 0 Lau 40 Run 6

Tahit Sub 0 Lau 20 Run 8

Tahit Sub 0 Lau 21 Run 10

Tahit Sub 0 Lau 22 Run 6

Not Allocating

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2


GeFor Sub 7 Lau 0 Run 3

GeFor Sub 0 Lau 0 Run 3

GeFor Sub 19 Lau 0 Run 3

GeFor Sub 14 Lau 0 Run 3

GeFor Sub 12 Lau 0 Run 3

Not Allocating

GeFor Sub 0 Lau 0 Run 3

GeFor Sub 16 Lau 0 Run 3

GeFor Sub 16 Lau 0 Run 3

GeFor Sub 15 Lau 0 Run 3


I am finding it very hard to believe that the Kernel Execution time is affected by memory.

By kernel execution time, as I understand from your post -- is the raw time spent executing inside the GPU.

Does your OpenCL kernel use any zero-copy memory ?

With a VM enabled driver, these buffers may be accessed by the kernel via PCIe (depending on how you are creating your cl_mem objects - AHP or UHP or Simply no flags)

And these transfers can get (technically) slowed down by other PCIe transactions that are happening on the system.


Does your OpenCL context contain both CPU and GPU (beacuse APP SDK will report both CPU and GPU device)?

OR Are you running only on a single-GPU-device context?

Is your NVIDIA device too present in the same system? OR Is it in a different system?


I see why you find it hard to believe. It makes little sense to me as well.

I was finishing up the example code showing the issue in its absolutely most simple form and it turns out that the problem seems to morph as I keep peeling layers of this issue.

What I have now is a very basic sample showing that actual kernel time does not change based on allocations, but launching time does. This only happens for AMD and only when running from inside Vs2012 AND when running from the game engine Unity. Running as external program it has no issues.

The actual project, however, before stripping off all the layers, shows the same behavior (slow kernel run when allocating memory for cpu calculations elsewhere) both when running as a standalone exe, running from vs2012 and from unity. The problem is that I cannot share that entire codebase. 😕

I can share both kernel and launch code in a stripped down version, if you or anyone else wants to take a look. It seems I cannot format text as code in here..???

Thanks for your time thus far. It does seem like there is an issue here, but it is very hard to reproduce in a form that I can share. That is a problem 😕


Prototype code.

public class GravityCalculatorGPU


        public GravityCalculatorGPU(int platformIndex, int deviceIndex)


            string clProgramSource  = the kernel from a resource;

            ComputePlatform platform = ComputePlatform.Platforms[platformIndex];

            IList<ComputeDevice> devices = new List<ComputeDevice>();

            device = platform.Devices[deviceIndex];


            ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);

            context = new ComputeContext(devices, properties, null, IntPtr.Zero);

            ComputeProgram program = new ComputeProgram(context, clProgramSource);



                program.Build(null, null, null, IntPtr.Zero);





               rethrow exception...


            bufferPosition = new ComputeBuffer<Vector3>(context, ComputeMemoryFlags.ReadOnly,1000000);

            bufferMass = new ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly, 1000000);

            bufferAcceleration = new ComputeBuffer<Vector3>(context, ComputeMemoryFlags.WriteOnly, 1000000);

            // Create the kernel function and set its arguments.

            kernel = program.CreateKernel("gravity");

            commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.Profiling);//.None);

            eventList = new ComputeEventList();


        struct Vector3


            double x, y, z, q;//placeholder for actual vector structure


        private ComputeContext context;

        private ComputeBuffer<Vector3> bufferPosition;

        private ComputeBuffer<double> bufferMass;

        private ComputeBuffer<Vector3> bufferAcceleration;

        private ComputeKernel kernel;

        private ComputeCommandQueue commands;

        private ComputeEventList eventList;

        private ComputeDevice device;

        private System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();

        public int Acceleration()


            int workGroupSize = 64;

            int globalSize = 4096*2*2;

            kernel.SetMemoryArgument(0, bufferPosition);

            kernel.SetMemoryArgument(1, bufferMass);

            kernel.SetMemoryArgument(2, bufferAcceleration);


            kernel.SetValueArgument(4, 6.67E-11);

            kernel.SetValueArgument(5, (double)1000);

            kernel.SetLocalArgument(6, sizeof(double) * 4 * workGroupSize);

            //commands.WriteToBuffer<Vector3>(system.Position, bufferPosition, true, 0, 0, system.ObjectCount, null);

            //commands.WriteToBuffer<double>(system.Mass, bufferMass, true, 0, 0, system.ObjectCount, null);

            ComputeEventList events = new ComputeEventList();



            commands.Execute(kernel, null, new long[] { globalSize }, new long[] { workGroupSize }, events);

          //  commands.ReadFromBuffer<Vector3>(bufferAcceleration, ref acceleration, true,0,0,system.ObjectCount,null);



            long que = events[0].EnqueueTime;

            long submit = events[0].SubmitTime - events[0].EnqueueTime;

            long start = events[0].StartTime - events[0].SubmitTime;

            long done = events[0].FinishTime - events[0].StartTime;

            Console.WriteLine(device.Name.PadRight(10).Substring(0, 10).PadRight(10) + " Sub {0} Lau {1} Run {2} All {3}", (int)(submit / 1E6), (int)(start / 1E6), (int)(done / 1E6), sw.ElapsedMilliseconds);

            return (int)(done/1000000);



#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void gravity(__global double4* position, __global double* mass, __global double4* acceleration, int firstParticle, double G, double epsilonSquared, __local double4* pblock)


int globalIndex                    = get_global_id(0);

int localIndex                    = get_local_id(0);

int totalSize                    = get_global_size(0);

int localSize                    = get_local_size(0);

int numBlocks                    = totalSize/localSize;

double4 p_i = position[globalIndex];

double4 a = (double4)(0,0,0,0);


          int numBlocksWithMass = ceil((float)firstParticle/localSize);

    for(int jb=0; jb < numBlocksWithMass; jb++) //foreach block containing mass, we need to find the contribution to a


        pblock[localIndex]                    = position          [jb*localSize+localIndex];

        pblock[localIndex].w          = mass                    [jb*localSize+localIndex];


                    barrier(CLK_LOCAL_MEM_FENCE); //wait untill every work item has fetched its part

        for(int j=0; j<localSize; j++)  //foreach fetched object, calculate




                double4 p_j = pblock;

                double4 d = p_j - p_i;

                                                        double invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z+epsilonSquared);

                    double f = p_j.w*invr*invr*invr;

                     a += G*f*d;



         barrier(CLK_LOCAL_MEM_FENCE); //wait for all work items to complete calculating before beinging in a new batch of neighbors


   acceleration[globalIndex] = a;



Hello Thomas,

1. Are you running the kernel on the GPU that is used for display?

2. When you said "NVIDIA's platform did not have this problem" -- Was NVIDIA GPU the display GPU? Was it in a different machine? Was it also run from Visual studio (or) was it run independently?

My guess is that either Visual studio (or) game engine unity (no idea what this is though) --- is trying to use the GPU when the kernel is running... May be, this is slowing you down.

But this does not explain how this is related to the CPU load...but then, you have said the problem is morphing...

Can you please answer these questions? It will help understand teh setup?


Unity3D engine


I have made a top thread post with more information and a sample.

Both the nvidia and the amd card is running a display. Same size as well.

The tests were run identically on the different gpus. Each run one display and both are in same system being run from same exe.

The trouble with explaining this is that as I was trying to create an example of the issue, it keeps changing slightly as I peeled layers off. In the end I had a sample with the same issue of amd being slow, but in a different way with slow launching kernels.

In the end I have modified the cloo demo project in a way that shows quite clearly that amd has huge overhead for kernel launch, and kernel event info not being right.

I hope that can clear it up a bit, so we can avoid talking about my code and just look at the problem... or perhaps this is not a problem on other systems... which is also worth finding out.

Thanks for the interest you have taken in this issue 🙂



If you are housing both GPUs in a single system, I believe you should have made 1 as primary and other as secondary.

So, Is AMD primary or secondary?

Can you flip this and test? (assuming it is easy to do so)

Repro Related


For sake of clarity, Can you please upload a ZIP file and a set of steps that you need to do to repro?

Also, Please publish your

1. DOT NET version .

2. Service pack of your window installation if any

3. Visual Studio version

4. (I dont understand game unity. How is it related to your app? Is it required for repro?)

5. Bitness of your system


The amd is primary, but it is not easily changed.

I am sorry, but I cannot share my entire code base, as this is not just a hobby project.

I have, however, uploaded a modified cloo demo which shows many of the same symptoms.

The code is built for .net 3.5, windows 8, vs2012, 64 bit.

Unity is a game engine which is used for various visualization sinze it can easily import .net classlibraries.

I do however beleive this is not really relevant for the issue any longer, since I see more or less the same in the modified cloo demo which I uploaded.

I am sorry I cannot be more informative. I was initially expecting this to be something which someone could recognize from the description and say "oh, that problem is caused by...something".

It would be interesting to hear how the cloo demo behaves on other systems, because if it is entirely different, then obviously its not a coding issue but rather a system/driver related problem.

Could you try that demo perhaps? It will (on a windows system) take a few seconds to start and run it.

Thanks for your time still 🙂


I somehow think AMD being primary could be the problem.

Can you just remove AMD card, and just run only with NVIDIA and see if you can replicate the problem.

I hope that in this case NVIDIA would be primary.

If it hits on NVIDIA too, you know that your problem is actually "windows" 🙂

+ Do you mean to say "the prototype code" you had posted somewhere above is the repro case?

I hope it is compilable.


Event timers may not be very trustworthy. It is recommended to use system timers (like getTimeofday etc).

Anyways are you using the newly allocated buffer (a), somewhere in your kernels. How much data do you need to transfer before kernel execution? And what flags you have used to create those buffers. Are you using MAP/UNMAP API or enqueueWrite/ReadBuffer?

Recommended way to check execution time would be to launch a clFinish() before kernel start. Start timer, Launch kernel, call clFinish and stop timer. Hope you are doing it this way.


I think your post clears  some of the questions below. Still, I just want to take a strong confirmation. Please bear with me.

1. How are your profiling time?

2. Do CPU and GPU share common buffers?

3. Can you tell more about the CPU operation that you are performing? Is it performing IO (or) is it memory intensive?

4. Have you flushed and finished all GPU operations before starting the kernel?

5. Does your command queue has Profiling enabled? (CL_PROFILING_ENABLE)

6. Does your kernel argument use memory buffers that are Use-Host-Ptred?


Thanks for taking an interest in this problem 🙂

1. How are your profiling time?

Initially using .net stopwatches, then using CodeXL and lastly using av event in the enquendrangekernel

2. Do CPU and GPU share common buffers?

No. They are 100% independent. The same even happens if the cpu simply loops with. variable+=sin(variable)

3. Can you tell more about the CPU operation that you are performing? Is it performing IO (or) is it memory intensive?

Not at all.

4. Have you flushed and finished all GPU operations before starting the kernel?

I actually terminate the method launching the kernel with finish before returning from that method.

5. Does your command queue has Profiling enabled? (CL_PROFILING_ENABLE)

Yes, now it does. Initially it did not.

6. Does your kernel argument use memory buffers that are Use-Host-Ptred?


I am working on cutting out a block of code which reproduces this problem, without including too much.

Journeyman III

I will reply to my own post with a bit more details, and a program and source showing the problem. This is the cloo (.net wrapper of openCL) demo application where it simply adds together two vectors. I made slight modifications to the vector addition demo.

The zip contains source for both cloo and the demo. Inside the clooutils/bin/release there is an exe showing the issue, in one form.

Select the device to run on and select vector addition as the demo to run.

What I see is that in this demo, the kernel launch time for my AMD Radeon 7990 HD GHz edition is always very large (4.7ms)  while for my NVidia GTX 580 it is low (0.2ms).

It seems that regardless of the work the kernel does, the launch time is 4-5 ms.

The other thing i note is that when I time this kernel like


start timer;

commands.Execute(kernel, null, new long[] { count }, new long[] { 64 }, eventList);            


stop timer;

print elapsed time

I get an elapsed time with much greater than the time the event reports from kernel launch submit to queue to kernel completion.

I may have a result like the following for nvidia


0,01936 0,297312 1,78352 2,100192

meaning total measured time was 114ms

Time before queues on device is 0,019ms

Time before running on device is 0,297ms

Time to run kernel is 1,78 ms

The time from submitting to queue till kernel completes is 2,10ms

When using amd i get


0,016988 4,749113 1,595111 6,361212

Again the time I measure is much greater than the time the kernel event report, even though i use clFinish both before and after kernel and timing. Additionally the launchtime is not 4.74ms as opposed to nvidias 0.29ms

This problem is not exactly what I see in my actual code, but likely other issues confuses the issue. This is however publicly available code which shows opencl timing being odd in general and launchtime being very odd for amd.

One additional thing i noted is that if I change the buffer declerations I can make the problem when worse.

Currently the buffers are declared as

ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrA);

If I change that to

ComputeBuffer<double> a = new ComputeBuffer<double>(context, ComputeMemoryFlags.ReadOnly, count);

The timing for nvidia is


0,018016 0,365472 32,480352 32,86384

and for amd


0,011928 19,886672 25,796296 45,694896

Now the time I measure is actually identical to the time the kernel event reports, but the launch time and running time is much greater.

I am really confused by this, so I home someone can at least confirm that it is not some issue on my system and additionally it would be nice to understand why this happens.

Thanks in advance for any help I can get 🙂


Hi Thomas,

Long Kernel launch times have been a common concern for both AMD and NVIDIA. You could check kernelLaunch sample present in APP SDK, and compare launch times of AMD & NVIDIA cards. Also can you please let us know, that you see this difference in launch times, with just one card attached at a time, so we can know whether it is multi-vendor setup issue.

Also as stated earlier, cl_event counters may be not be trustworthy and also measuring launch time for a single command may take more time than usual, as commands are generally dispatched in batches. Can you run the EnqueueNDRAngeKernel function in a loop and then check the timings?

EDIT: On my machine with HF 5770 + Catalyst 13.1, Win7 64 bit , kernel Launch gives values around 47us. When the kernel batch size is increased, this value is as low as 6us.

Message was edited by: Himanshu Gautam


Try clFlush() API after execute. I hope the cloo wrapper's execute API does this.

I have seen differences in timing if clFlush() is not used in Amd Platform.

Also, Buffer flags can affect performances wildly. Table 4.2 of APP programming guide specifies how AMD's runtime interprets the flag...And, the section on performance-of-memory-objects details on how RT actually transfers data.


Thanks for the hints.

Flushing did not change anything. The timing remains the same. As I described previously, I have tried different buffer flags, but it seems the current ones are actually the fastest, since the problem only worsens when I try different (valid) types.

I was originally asked for a zip to reproduce the problem, and eventually I uploaded the modified cloo demo project which does that, but I assume no one have yet had the time to try and run it? I have tried just about everything here now and it would be really interesting to learn if others experience the same behaviour with the same code.


Hi Thomas,

I will try your code soon.

But can you share your results of the kernelLaunch sample? Please try both AMD and NV cards, making them primary (one after the other).

Kernel launch has been slow but not as slow as 4ms. I will update you once i am able to run your code.



I ran the modified cloo zip file and do find the setup time to be quite huge. But here are few possible reasons for it:

1. The kernel is run just once. GPU warm-up takes some time. The first kernel execution is also much slower than average kernel execution time. you must run the kernel several times to measure kernel launch time. Check KernelLaunch Sample for that.

2. As we are not sure, what operation is happening exactly at what time, some of the buffer transfers to GPU via PCIe may also be causing increase in launch time.

3. Cloo might have some overhead of its own.

PFA a simple OpenCL code, to get the timing information of kernel execution. This gives launch time of 80 micro seconds for the same data size as yours (10000000 uints).

1. The kernel is run just once. GPU warm-up takes some time. The first kernel execution is also much slower than average kernel execution time. you must run the kernel several times to measure kernel launch time. Check KernelLaunch Sample for that.

Thanks for your reply. You have a point about the kernel needing to warm up, and that could be an issue in the cloo example. It was just that the nvidia card had a much different behavior.

You are, however, perfectly right in kernel warm up being the issue in my actual program. I did not realize that because I actually call the kernel many many times, and it was always running slower if the cpu had also been working, so I concluded that this was not the issue... but it turns out it is (or so it seems now).

My test was looping over

cpu calculate problem

GPU calculate same problem

compare results


When I removed the cpu calculation, gpu ran fast. With cpu, gpu ran slow. But only for my amd card.

By accident I now discovered that if I replaced "cpu calculate problem" with "Thread.Sleep(1100)", then the issue was exactly the same, but if I slept less than 1000ms, then the GPU was fast again.

So in conclusion... if the kernel is run again more than 1s after last run, it will start cold. That I did not know, and I have not seen it mentioned anywhere. I will close my current question and create a new one about this issue.

Thanks for all the response I have received from especially you here 🙂


Thomas: "By accident I now discovered that if I replaced "cpu calculate problem" with "Thread.Sleep(1100)", then the issue was exactly the same, but if I slept less than 1000ms, then the GPU was fast again"

Hmmm. That's interesting and quite possibly true. Thanks for your support. I look forward to the new questions you plan to post

EDIT: It will help me and also others if you mark the reply you find answering you issue as "correct Answer"  or "Helpful Answer".

Message was edited by: Himanshu Gautam