cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

ninazero
Journeyman III
Journeyman III

OpenCL clSetKernelArg performance issue

Hello,

I'm working on a real time ray tracer with OpenCL.

I have a structure that describes the camera, with position, orientation, field of view.
Since the camera moves around I send it every frame to the GPU.

And I'm doing it like this:

Computedcamera    cm;

// some code here

clSetKernelArg(_kernel, 0, sizeof(Computedcamera), &cm);

and my kernel looks like this:

kernel void raytracer(Computedcamera const *camera, /*others arguments) { /* */ }

This method gives best performance, 48FPS on my test scene, but it doesn't works on devices from other brand, like Intel or Nvidia.

If I change my kernel declaration to this (remove the *, to pass the argument by value):
kernel void raytracer(Computedcamera const camera, /*others arguments) { /* */ }

It now works on my CPU (Intel), but the performance on my GPU drops to 27FPS.

So I tried to pass this argument by buffer:

initialisation:

_camera_mem = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY, sizeof(Computedcamera), 0, &error);

each frame:

clEnqueueWriteBuffer(_queue, _camera_mem, CL_TRUE, 0, sizeof(Computedcamera), (void *)&cm, 0, 0, 0);

clSetKernelArg(_kernel, 0, sizeof(cl_mem), &_camera_mem);


This last method works on all devices (AMD GPU, Intel CPU) but the performance on my GPU is around 43FPS (5 FPS less than the first method).

I don't understand why the first method is faster than the others and why it works?!

My config:

Win10 64bits, AMD APP SDK 3.0, i7 920, R9 Nano.

Tags (2)
0 Kudos
Reply
8 Replies
pinform
Staff
Staff

Re: OpenCL clSetKernelArg performance issue

Welcome and thanks for posting.

I have white-listed you, so you should be able to directly post in the relevant forum. As this post is relevant to OpenCL, I am moving it to the OpenCL forum.

Happy posting.

--Prasad

0 Kudos
Reply
nou
Exemplar
Exemplar

Re: OpenCL clSetKernelArg performance issue

I woul try pass it as buffer but use __constant memory space.

0 Kudos
Reply
ninazero
Journeyman III
Journeyman III

Re: OpenCL clSetKernelArg performance issue

Hi, thanks for your replies,

I have tried __constant and __global memory space while passing it as buffer, it works well, but it stills slower than the first method and 5FPS is important.

0 Kudos
Reply
dipak
Staff
Staff

Re: OpenCL clSetKernelArg performance issue

Anyway the first method is not a valid way to pass a pointer argument to a kernel because the OpenCL spec. says:

If the argument is declared to be a pointer of a built-in scalar or vector type, or a user defined structure type in the global or constant address space, the memory object specified as argument value must be a buffer object (or NULL).

So, it's better to avoid it.

0 Kudos
Reply
tzachi_cohen
Staff
Staff

Re: OpenCL clSetKernelArg performance issue

Can you run a CodeXL application timeline trace session and test whether the kernel execution time got prolonged or is it something else?

0 Kudos
Reply
ninazero
Journeyman III
Journeyman III

Re: OpenCL clSetKernelArg performance issue

Hello,

I have run CodeXL application timeline trace session and yes the kernel execution time got prolonged.
I have made some changes since my previous post, like using OpenGL interop for drawing the texture computed by the kernel, and making the clEnqueueWriteBuffer non blocking.

Now, I'm talking only about the broken clSetKernelArg method and the clCreateBuffer + clEnqueueWriteBuffer method.
The data about the camera used in the kernel are 2 lines always executed and only once and are not related to another argument passed to the kernel.
When the camera doesn't look at the scene I got 500fps with the first method and 800fps with the second one

When the camera looks at the scene it's 160fps with the first one and 130fps with the second one.
The data transferred to the device are constant (and the time taken by the transfer), the only thing that change is the execution time of the kernel.

0 Kudos
Reply
maxdz8
Elite
Elite

Re: OpenCL clSetKernelArg performance issue

Some historical perspective.

Setting shader kernel arguments has always been a very expensive operation by all APIs and OpenCL is the same.

Setting kernel parameters is really supposed to be a single-shot operation immediately after creation and then forgot forever.

Therefore it is no surprise using a buffer and avoiding resetting kernel args to be faster as in first case.  It is much more surprising this is slower when more work is done, very disappointing to see the behavior is inconsistent.

This is especially the case for GCN, if you look at the details it has no real hardware constant buffer support. Constants are emulated at driver level, the driver must figure out a layout and push the data in a buffer synthesized for you.

Try the following: at beginning of your kernel, use a block copy operation (or do the copy yourself) to pull data from global/constant to LDS and then read from there.

0 Kudos
Reply
cgrant78
Adept III
Adept III

Re: OpenCL clSetKernelArg performance issue

Could you get actual clock timings instead of using FPS as a performance metric? Just saying my algorithm went from 1000 FPS to 100 FPS gives no indication to anyone trying to ascertain why there is a performance decrease. This was already suggested before, but FPS is not a valid performance metric especially for a developer.

0 Kudos
Reply