cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ninazero
Journeyman III

OpenCL clSetKernelArg performance issue

Hello,

I'm working on a real time ray tracer with OpenCL.

I have a structure that describes the camera, with position, orientation, field of view.
Since the camera moves around I send it every frame to the GPU.

And I'm doing it like this:

Computedcamera    cm;

// some code here

clSetKernelArg(_kernel, 0, sizeof(Computedcamera), &cm);

and my kernel looks like this:

kernel void raytracer(Computedcamera const *camera, /*others arguments) { /* */ }

This method gives best performance, 48FPS on my test scene, but it doesn't works on devices from other brand, like Intel or Nvidia.

If I change my kernel declaration to this (remove the *, to pass the argument by value):
kernel void raytracer(Computedcamera const camera, /*others arguments) { /* */ }

It now works on my CPU (Intel), but the performance on my GPU drops to 27FPS.

So I tried to pass this argument by buffer:

initialisation:

_camera_mem = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY, sizeof(Computedcamera), 0, &error);

each frame:

clEnqueueWriteBuffer(_queue, _camera_mem, CL_TRUE, 0, sizeof(Computedcamera), (void *)&cm, 0, 0, 0);

clSetKernelArg(_kernel, 0, sizeof(cl_mem), &_camera_mem);


This last method works on all devices (AMD GPU, Intel CPU) but the performance on my GPU is around 43FPS (5 FPS less than the first method).

I don't understand why the first method is faster than the others and why it works?!

My config:

Win10 64bits, AMD APP SDK 3.0, i7 920, R9 Nano.

0 Likes
8 Replies
pinform
Staff

Welcome and thanks for posting.

I have white-listed you, so you should be able to directly post in the relevant forum. As this post is relevant to OpenCL, I am moving it to the OpenCL forum.

Happy posting.

--Prasad

0 Likes
nou
Exemplar

I woul try pass it as buffer but use __constant memory space.

0 Likes

Hi, thanks for your replies,

I have tried __constant and __global memory space while passing it as buffer, it works well, but it stills slower than the first method and 5FPS is important.

0 Likes

Anyway the first method is not a valid way to pass a pointer argument to a kernel because the OpenCL spec. says:

If the argument is declared to be a pointer of a built-in scalar or vector type, or a user defined structure type in the global or constant address space, the memory object specified as argument value must be a buffer object (or NULL).

So, it's better to avoid it.

0 Likes

Can you run a CodeXL application timeline trace session and test whether the kernel execution time got prolonged or is it something else?

0 Likes

Hello,

I have run CodeXL application timeline trace session and yes the kernel execution time got prolonged.
I have made some changes since my previous post, like using OpenGL interop for drawing the texture computed by the kernel, and making the clEnqueueWriteBuffer non blocking.

Now, I'm talking only about the broken clSetKernelArg method and the clCreateBuffer + clEnqueueWriteBuffer method.
The data about the camera used in the kernel are 2 lines always executed and only once and are not related to another argument passed to the kernel.
When the camera doesn't look at the scene I got 500fps with the first method and 800fps with the second one

When the camera looks at the scene it's 160fps with the first one and 130fps with the second one.
The data transferred to the device are constant (and the time taken by the transfer), the only thing that change is the execution time of the kernel.

0 Likes

Some historical perspective.

Setting shader kernel arguments has always been a very expensive operation by all APIs and OpenCL is the same.

Setting kernel parameters is really supposed to be a single-shot operation immediately after creation and then forgot forever.

Therefore it is no surprise using a buffer and avoiding resetting kernel args to be faster as in first case.  It is much more surprising this is slower when more work is done, very disappointing to see the behavior is inconsistent.

This is especially the case for GCN, if you look at the details it has no real hardware constant buffer support. Constants are emulated at driver level, the driver must figure out a layout and push the data in a buffer synthesized for you.

Try the following: at beginning of your kernel, use a block copy operation (or do the copy yourself) to pull data from global/constant to LDS and then read from there.

0 Likes
cgrant78
Adept III

Could you get actual clock timings instead of using FPS as a performance metric? Just saying my algorithm went from 1000 FPS to 100 FPS gives no indication to anyone trying to ascertain why there is a performance decrease. This was already suggested before, but FPS is not a valid performance metric especially for a developer.

0 Likes