cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

tgreen
Journeyman III

GPU kernel has low performance if CPU has worked _before_ launch

It was hard to find a good single line title for my question, but the longer form is this.

I have an application which first does some heavy calculation on the cpu. Then it calls an opencl kernel. The two calculation jobs have nothing to do with eachother. They just both have do be done. The cpu portion is slow and the gpu is fast.

This was initially developed using a Nvidia GTX 580 and, naturally it didn't affect the running time for the kernel whether the cpu had been calculating before I called the kernel or if it had been idle. The kernel took the same time regardless.

Now I started testing with an AMD Radeon 7970 hd and to my surprise it was running its kernel a bit slower than what I saw with the gtx 580 card.

After some investigation it turns out that the AMD card is faster than the Nvidia if the cpu had not done any computation before launcing the kernel, but if the cpu had been working before kernel launch, then AMD was slower.

More concrete,

the AMD kernel run took 9ms if the cpu had been working and 3 if the cpu had been sitting idle.

the nvidia kernel run took 8ms if the cpu had been working and 8ms if the cpu had been sitting idle.

I validated these measurements using CodeXL and it shows the same timing as I measured inside my program.

I tested various "kinds" of cpu work and even if I just keep doing the same calculation on a single variable, this slowdown happens, so it doesnt seem to be related to the amount of bytes in large buffers being moved around.

System:

windows 8, Intel core2 duo

C#.net using cloo

Latest drivers for amd and nvidia

I must say I am really confused as to how this makes any sense...?

0 Kudos
Reply
24 Replies
nou
Exemplar

Re: GPU kernel has low performance if CPU has worked _before_ launch

do you use all CPU cores? for example LuxRender is quite slowed down if you don't leave single CPU core free for handling GPU.

0 Kudos
Reply
tgreen
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

Commonly yes, but I have tested with just a single thread.

Besides, the real puzzle is that the cpu runs its job and when that is done I launch the kernel which is then slow.

I even tired adding in a delay of 1 second between cpu finishing and kernel starting. The issue remained.

0 Kudos
Reply
tgreen
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

I might add that now that I look at the kernel launch event, I see that the time passing before the kernel is actually started on the device, it remains somewhat constant, but the actual time from kernel start to done triples when the cpu has been busy before launching the kernel.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: GPU kernel has low performance if CPU has worked _before_ launch

That's interesting.

Please post a copy of your code (as zip file) so that we can reproduce here.

Please include the following details as well.

Platform - win32 / win64 / lin32 / lin64 or some other?

Win7 or win vista or Win8.. Similarly for linux, your distribution

Version of driver and APP SDK

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: GPU kernel has low performance if CPU has worked _before_ launch

I think your post clears  some of the questions below. Still, I just want to take a strong confirmation. Please bear with me.

1. How are your profiling time?

2. Do CPU and GPU share common buffers?

3. Can you tell more about the CPU operation that you are performing? Is it performing IO (or) is it memory intensive?

4. Have you flushed and finished all GPU operations before starting the kernel?

5. Does your command queue has Profiling enabled? (CL_PROFILING_ENABLE)

6. Does your kernel argument use memory buffers that are Use-Host-Ptred?

0 Kudos
Reply
tgreen
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

Thanks for taking an interest in this problem 🙂

1. How are your profiling time?

Initially using .net stopwatches, then using CodeXL and lastly using av event in the enquendrangekernel

2. Do CPU and GPU share common buffers?

No. They are 100% independent. The same even happens if the cpu simply loops with. variable+=sin(variable)

3. Can you tell more about the CPU operation that you are performing? Is it performing IO (or) is it memory intensive?

Not at all.

4. Have you flushed and finished all GPU operations before starting the kernel?

I actually terminate the method launching the kernel with finish before returning from that method.

5. Does your command queue has Profiling enabled? (CL_PROFILING_ENABLE)

Yes, now it does. Initially it did not.

6. Does your kernel argument use memory buffers that are Use-Host-Ptred?

No

I am working on cutting out a block of code which reproduces this problem, without including too much.

0 Kudos
Reply
tgreen
Journeyman III

Re: GPU kernel has low performance if CPU has worked _before_ launch

I almost have the code to show the problem, but it turns out that the cause was not the cpu doing previous work, as I initially thought. It has to do with memory allocation.

If I allocate and free memory (a lot) before launching the kernel, then AMD has major problems and Nvidia doesnt notice it.

I will upload a zip later, if still relevant, but right now I can show the problem easily in source

In the following, if the bold line is active, amd has problems, nvidia does not. If it is commented out, neither has problems.

Obviously, in the actual code I am not doing something silly like this, but likely something which provokes the same issue.

After the source, I show the event timing. Tahiti is amd and geforce is nvidia. First 5 lines or each is with the allocation and the 5 next are without.

class PerformanceTestLauncher

{

        static double[] a = new double[10000000]; 

        static void test(GravityCalculatorGPU gcGPU)

        {

            a = new double[10000000];    //try commenting this line out

            gcGPU.Acceleration(); //this launches the kernel and prints event timing

        }

        static void Main(string[] args)

        {

            GravityCalculatorGPU gcGPU = new GravityCalculatorGPU(1,0);

            for (int i = 0; i < 30; i++)

            {

                test(gcGPU);

            }

        }

    }

Sub is time to submit, lau time to launch and run time to run. Each value is divided by 1E+6 to attemt getting it in ms.

Allocating

Tahit Sub 0 Lau 21 Run 6

Tahit Sub 0 Lau 40 Run 6

Tahit Sub 0 Lau 20 Run 8

Tahit Sub 0 Lau 21 Run 10

Tahit Sub 0 Lau 22 Run 6

Not Allocating

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Tahit Sub 0 Lau 0 Run 2

Allocating

GeFor Sub 7 Lau 0 Run 3

GeFor Sub 0 Lau 0 Run 3

GeFor Sub 19 Lau 0 Run 3

GeFor Sub 14 Lau 0 Run 3

GeFor Sub 12 Lau 0 Run 3

Not Allocating

GeFor Sub 0 Lau 0 Run 3

GeFor Sub 16 Lau 0 Run 3

GeFor Sub 16 Lau 0 Run 3

GeFor Sub 15 Lau 0 Run 3

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: GPU kernel has low performance if CPU has worked _before_ launch

I am finding it very hard to believe that the Kernel Execution time is affected by memory.

By kernel execution time, as I understand from your post -- is the raw time spent executing inside the GPU.

Does your OpenCL kernel use any zero-copy memory ?

With a VM enabled driver, these buffers may be accessed by the kernel via PCIe (depending on how you are creating your cl_mem objects - AHP or UHP or Simply no flags)

And these transfers can get (technically) slowed down by other PCIe transactions that are happening on the system.

btw,

Does your OpenCL context contain both CPU and GPU (beacuse APP SDK will report both CPU and GPU device)?

OR Are you running only on a single-GPU-device context?

Is your NVIDIA device too present in the same system? OR Is it in a different system?

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: GPU kernel has low performance if CPU has worked _before_ launch

Event timers may not be very trustworthy. It is recommended to use system timers (like getTimeofday etc).

Anyways are you using the newly allocated buffer (a), somewhere in your kernels. How much data do you need to transfer before kernel execution? And what flags you have used to create those buffers. Are you using MAP/UNMAP API or enqueueWrite/ReadBuffer?

Recommended way to check execution time would be to launch a clFinish() before kernel start. Start timer, Launch kernel, call clFinish and stop timer. Hope you are doing it this way.

0 Kudos
Reply