Welcome and thanks for posting.
I have white-listed you, so you should be able to directly post in the relevant forum. As this post is relevant to OpenCL, I am moving it to the OpenCL forum.
I woul try pass it as buffer but use __constant memory space.
Hi, thanks for your replies,
I have tried __constant and __global memory space while passing it as buffer, it works well, but it stills slower than the first method and 5FPS is important.
Anyway the first method is not a valid way to pass a pointer argument to a kernel because the OpenCL spec. says:
If the argument is declared to be a pointer of a built-in scalar or vector type, or a user defined structure type in the global or constant address space, the memory object specified as argument value must be a buffer object (or NULL).
So, it's better to avoid it.
Can you run a CodeXL application timeline trace session and test whether the kernel execution time got prolonged or is it something else?
I have run CodeXL application timeline trace session and yes the kernel execution time got prolonged.
I have made some changes since my previous post, like using OpenGL interop for drawing the texture computed by the kernel, and making the clEnqueueWriteBuffer non blocking.
Now, I'm talking only about the broken clSetKernelArg method and the clCreateBuffer + clEnqueueWriteBuffer method.
The data about the camera used in the kernel are 2 lines always executed and only once and are not related to another argument passed to the kernel.
When the camera doesn't look at the scene I got 500fps with the first method and 800fps with the second one
When the camera looks at the scene it's 160fps with the first one and 130fps with the second one.
The data transferred to the device are constant (and the time taken by the transfer), the only thing that change is the execution time of the kernel.
Some historical perspective.
Setting shader kernel arguments has always been a very expensive operation by all APIs and OpenCL is the same.
Setting kernel parameters is really supposed to be a single-shot operation immediately after creation and then forgot forever.
Therefore it is no surprise using a buffer and avoiding resetting kernel args to be faster as in first case. It is much more surprising this is slower when more work is done, very disappointing to see the behavior is inconsistent.
This is especially the case for GCN, if you look at the details it has no real hardware constant buffer support. Constants are emulated at driver level, the driver must figure out a layout and push the data in a buffer synthesized for you.
Try the following: at beginning of your kernel, use a block copy operation (or do the copy yourself) to pull data from global/constant to LDS and then read from there.
Could you get actual clock timings instead of using FPS as a performance metric? Just saying my algorithm went from 1000 FPS to 100 FPS gives no indication to anyone trying to ascertain why there is a performance decrease. This was already suggested before, but FPS is not a valid performance metric especially for a developer.