cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bubu
Adept II

Suggestions for OpenCL 2

Please, add this:

 

1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

 

2. Enable the "register" C keyword hint so we can effectively control better which variables can be swapped to memory and which ones should stay as registers. This can help the compiler to reduce the register pressure in a better way.

 

3. Add C++ support.

 

4. Add a "virtual memory" mechanism and flags for OpenCL's buffers to indicate its contents must be flushed to the hard disk like the CPU's virtual memory is swapped. This is needed to manage big data assets that don't fit in the (usually low-quantity) GPU's video memory. DX9 used a "managed memory pool" mechanism for instance.

 

5. Add a kernel execution priority parameter. With this, we could execute kernels without disturbing the OS's window manager and to indicate which ones are more important for concurrent kernel execution.

 

6. Add some reduction macros or functions for +, -, *, /, min/max, etc... And add a quick-sort ( or radix sort ) intrinsic:

Example:

 

 

__kernel void MyKernel ( __global float *values ) { const float sumOfAllValues = CL_REDUCTION_SUM ( values, 0, 256 ); //ptr, offset, nElements const float minOfAllValues = CL_REDUCTION_MIN ( values, 0, 256 ); const float maxOfAllValues = CL_REDUCTION_MAX ( values, 0, 256 ); qsort ( values, 0, sizeof(float), 256 ); //offset, sizeof each element, nElements ... }

0 Likes
18 Replies
laobrasuca
Journeyman III

Suggestions for OpenCL 2

Originally posted by: bubu Please, add this:

 

 

 

1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

+1 here! Not that the other points are not important too, but for God sakes why there's no float atomic for the functions where there are atomic for integers??? I've never understood why so, is that really hard to have them on float??? One simple, very simple example on where you would need atomic for floats is when you need to compute the normals for the vertices of a mesh: after computing the normals per triangle (imagine one work-item per triangle), one would need to sum up the results atomically to the normal output array of the 3 vertices of the triangle (previously initialized to zero) so that in another kernel one could normalize the normal for each vector (one work-item per vector). Since we can't use atomics for now, we are forced to create either a temporary buffer to store the normals per triangle (them figure out an way of finding for each vector which triangles it belongs to) or a temporary integer normal buffer where one would store the results of the atomic sum of the normals pre-quantized to integer values with a very thin quantization step. Either way you need to use additional memory, while if we would have atomic sum to float none of these would be necessary.

0 Likes
davibu
Journeyman III

Suggestions for OpenCL 2

Originally posted by: bubu Please, add this:

 

 1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

 

 

Float atomics can already be achieved with the atomic exchange instruction and a couple of line of code. However, given how much OpenCL is "float-centric", it makes a lot of sense to have a native implementation.

I'm ready to kill for the C++ support too

I would like to see a mechanism to "translate" pointers so we can effectively use pointers between different block of GPU memory. For instance you, build a list on the CPU ram and pointers are translate to the GPU address space when transferred to the GPU. At the moment I have always to build some index table to express the same data structure with the GPU.

 

0 Likes
himanshu_gautam
Grandmaster

Suggestions for OpenCL 2

Thank you for your suggestions.

Please feel free to grow this post with relevent data.

 

0 Likes
laobrasuca
Journeyman III

Suggestions for OpenCL 2

include pointers to pointers! As such we can use arrays of buffers as arguments to the kernel. This would avoid one to run kernel several times (one for each buffer) or decrease the number of arguments of the kernel (in case one put each buffer as an argument to the kernel). This would be very useful whenever one have several (maybe hundreds) of small buffers which cant be concatenated in one big buffer (for whatever reason) (like me 😕 )

0 Likes
LeeHowes
Staff
Staff

Suggestions for OpenCL 2

Pointers to pointers and pointers to buffers are very different things. Where do you unpack the buffer object to send it into the kernel correctly?

Float atomics are not a CL problem, they're a hardware problem. Integer ALUs take up barely any space, you can throw those around and do atomics on data in cache easily. Floating point has higher latency, it sits in a pipeline for longer. It's much much harder to synchronise. It doesn't really help for normals anyway, does it? Or maybe it is safe to do component by component on normals as long as you normalise later.

In other words "I've never understood why so, is that really hard to have them on float???" yes. It is. To do inefficient ones is sortof ok at the cost of a lot of lock data in cache, they would have to be a lot less efficient than integer ones are currently. There are also all sorts of questions about IEEE compliance of the atomic operations.

You can come close, though. You can do atomic exchanges on the data. You can treat floats as ints in many cases if you want to do atomic min, max, cmpxchg and of course the bitwise ones. You have to be slightly careful of course because min/max etc wouldn't handle denorms correctly.

All your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.

 

I will say that I think all of the suggestions in this thread are CL 1.x suggestions, not CL 2.x. I'd rather think a lot bigger for CL 2 than this (though I can't guarantee that anyone else would agree with me). Does anyone have any suggestions in that direction?

0 Likes
eduardoschardong
Journeyman III

Suggestions for OpenCL 2

Originally posted by: LeeHowesAll your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.


 

    

*beer*
*beer*


 

Originally posted by: LeeHowesI will say that I think all of the suggestions in this thread are CL 1.x suggestions, not CL 2.x. I'd rather think a lot bigger for CL 2 than this (though I can't guarantee that anyone else would agree with me). Does anyone have any suggestions in that direction?


Well... Let's start CL 2 from scratch?

What I want is more abstraction, currently we have to write one code for each target for performance reasons, in a good language the programmer should write what he wants to do and the compiler should be able to translate it on machine code that runs well on target, I'm not sure it is possible on a future extension of OpenCL with so many low level features being the base and defaults.

 

For the list of features that could be in CL 1.x mine is global sync, or a functional returning how many threads I can spawn if I want to have all running at same time.

 

EDIT: I almost forgot: Libraries.

 

0 Likes
LeeHowes
Staff
Staff

Suggestions for OpenCL 2

I think CL isn't the right place for that kind of abstraction. CL is a pretty print of the intermediate language of the device, you can't expect abstraction because you'd never be able to tune for optimality. What would be nice would be people developing high level languages on top of OpenCL (and I don't mean pragmas for C++) but that can come with time.

Global sync and spawning to fill are ok - you have to be careful because a lot of reasons for doing that end up occupying the device for too long.

0 Likes
laobrasuca
Journeyman III

Suggestions for OpenCL 2

Originally posted by: LeeHowes Pointers to pointers and pointers to buffers are very different things. Where do you unpack the buffer object to send it into the kernel correctly?


thx for the reply.

since i've already made a thread on this, it would be simpler if you take a look on it: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=139919&enterthread=y

it's kind having several buffers (each one being an array of, for example, floats) and wiling to use them at once in one same kernel. If pointers to pointers and pointers to buffers are very different things, yes i'm talking about pointer to buffers. But it seems that buffers are thread as pointers in the __kernel declaration (maybe i'm wrong), that's why i declare the pointer to buffers as a double pointer (**array_of_buffer), and that's why the compiler crashes with this pointer-to-pointer error. If you have any suggestion to make pointers-to-buffers useful in the kernel, i would be really grateful.

Float atomics are not a CL problem, they're a hardware problem. Integer ALUs take up barely any space, you can throw those around and do atomics on data in cache easily. Floating point has higher latency, it sits in a pipeline for longer. It's much much harder to synchronise. It doesn't really help for normals anyway, does it? Or maybe it is safe to do component by component on normals as long as you normalise later.

 

In other words "I've never understood why so, is that really hard to have them on float???" yes. It is. To do inefficient ones is sortof ok at the cost of a lot of lock data in cache, they would have to be a lot less efficient than integer ones are currently. There are also all sorts of questions about IEEE compliance of the atomic operations.

 

You can come close, though. You can do atomic exchanges on the data. You can treat floats as ints in many cases if you want to do atomic min, max, cmpxchg and of course the bitwise ones. You have to be slightly careful of course because min/max etc wouldn't handle denorms correctly.

 

thx for given an aperçu about the difficulty. as for the normals, i would tend to use atomics to average the normals of the triangles for a given vertex at the same time i compute them. it would avoid me to spend additional memory on creating a structure which holds the triangle number for each vertex (and as a consequence, avoid compute the normal for a triangle 3 times) (or additional buffer store temporary normals), while we already have the list of indices of vertex to construct each triangles. But as you say, it maybe would be better to compute the same normal several times than using atomic: you do more calculations but at least they are in parallel, while atomics are not.

All your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.


oh, glad to amuse you (and eduardo), it's always a pleasure But, pointers are very useful on, for example, acquiring or release multiple GL buffers with one command call, or releasing several events with one command call, and stuff like these. I cant see C programing without pointers at all. C++ containers are cool, but slower, not to mention java/c# stuff. Would you prefer opencl/opengl on a higher level language? I think the API as it is now is kind well done, even if stuff is maybe missing. Everyone who codes for performance (fast computing and strict memory usage) does so in C or C++ (avoiding STL containers as much as possible), if not in Fortran (yey). Garbage collector to control the VRAM memory? No, thx. Or, maybe i'm miss understanding you and not talking about the same thing.

0 Likes
Illusio
Journeyman III

Suggestions for OpenCL 2

Originally posted by: laobrasucaGarbage collector to control the VRAM memory? No, thx.

Actually, I'd kind of like that. If I could get a "unified" 64 bit virtual address space between the GPU and the host process that would basically treat the entire VRAM as a cache.

In such a regime a garbage collector could make pretty good sense given an OpenCL aware high level language. It's not like there would be a major downside to it anyway, with caching being handled by hardware through actual usage patterns and possibly programmed preloads.

0 Likes