Archives Discussions

bubu · ‎09-04-2010

Please, add this:

1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

2. Enable the "register" C keyword hint so we can effectively control better which variables can be swapped to memory and which ones should stay as registers. This can help the compiler to reduce the register pressure in a better way.

3. Add C++ support.

4. Add a "virtual memory" mechanism and flags for OpenCL's buffers to indicate its contents must be flushed to the hard disk like the CPU's virtual memory is swapped. This is needed to manage big data assets that don't fit in the (usually low-quantity) GPU's video memory. DX9 used a "managed memory pool" mechanism for instance.

5. Add a kernel execution priority parameter. With this, we could execute kernels without disturbing the OS's window manager and to indicate which ones are more important for concurrent kernel execution.

6. Add some reduction macros or functions for +, -, *, /, min/max, etc... And add a quick-sort ( or radix sort ) intrinsic:

Example:

__kernel void MyKernel ( __global float *values ) { const float sumOfAllValues = CL_REDUCTION_SUM ( values, 0, 256 ); //ptr, offset, nElements const float minOfAllValues = CL_REDUCTION_MIN ( values, 0, 256 ); const float maxOfAllValues = CL_REDUCTION_MAX ( values, 0, 256 ); qsort ( values, 0, sizeof(float), 256 ); //offset, sizeof each element, nElements ... }

laobrasuca · ‎09-05-2010

Originally posted by: bubu Please, add this:

1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

+1 here! Not that the other points are not important too, but for God sakes why there's no float atomic for the functions where there are atomic for integers??? I've never understood why so, is that really hard to have them on float??? One simple, very simple example on where you would need atomic for floats is when you need to compute the normals for the vertices of a mesh: after computing the normals per triangle (imagine one work-item per triangle), one would need to sum up the results atomically to the normal output array of the 3 vertices of the triangle (previously initialized to zero) so that in another kernel one could normalize the normal for each vector (one work-item per vector). Since we can't use atomics for now, we are forced to create either a temporary buffer to store the normals per triangle (them figure out an way of finding for each vector which triangles it belongs to) or a temporary integer normal buffer where one would store the results of the atomic sum of the normals pre-quantized to integer values with a very thin quantization step. Either way you need to use additional memory, while if we would have atomic sum to float none of these would be necessary.

davibu · ‎09-05-2010

Originally posted by: bubu Please, add this:

1. Local/global atomics for FLOAT variables. This could be used by a lot of sorting algorithms, order-independent transparency, etc... I heard Crytek was demanding it also.

Float atomics can already be achieved with the atomic exchange instruction and a couple of line of code. However, given how much OpenCL is "float-centric", it makes a lot of sense to have a native implementation.

I'm ready to kill for the C++ support too

I would like to see a mechanism to "translate" pointers so we can effectively use pointers between different block of GPU memory. For instance you, build a list on the CPU ram and pointers are translate to the GPU address space when transferred to the GPU. At the moment I have always to build some index table to express the same data structure with the GPU.

himanshu_gautam · ‎09-05-2010

Thank you for your suggestions.

Please feel free to grow this post with relevent data.

laobrasuca · ‎09-22-2010

include pointers to pointers! As such we can use arrays of buffers as arguments to the kernel. This would avoid one to run kernel several times (one for each buffer) or decrease the number of arguments of the kernel (in case one put each buffer as an argument to the kernel). This would be very useful whenever one have several (maybe hundreds) of small buffers which cant be concatenated in one big buffer (for whatever reason) (like me 😕 )

LeeHowes · ‎09-22-2010

Pointers to pointers and pointers to buffers are very different things. Where do you unpack the buffer object to send it into the kernel correctly?

Float atomics are not a CL problem, they're a hardware problem. Integer ALUs take up barely any space, you can throw those around and do atomics on data in cache easily. Floating point has higher latency, it sits in a pipeline for longer. It's much much harder to synchronise. It doesn't really help for normals anyway, does it? Or maybe it is safe to do component by component on normals as long as you normalise later.

In other words "I've never understood why so, is that really hard to have them on float???" yes. It is. To do inefficient ones is sortof ok at the cost of a lot of lock data in cache, they would have to be a lot less efficient than integer ones are currently. There are also all sorts of questions about IEEE compliance of the atomic operations.

You can come close, though. You can do atomic exchanges on the data. You can treat floats as ints in many cases if you want to do atomic min, max, cmpxchg and of course the bitwise ones. You have to be slightly careful of course because min/max etc wouldn't handle denorms correctly.

All your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.

I will say that I think all of the suggestions in this thread are CL 1.x suggestions, not CL 2.x. I'd rather think a lot bigger for CL 2 than this (though I can't guarantee that anyone else would agree with me). Does anyone have any suggestions in that direction?

eduardoschardong · ‎09-22-2010

Originally posted by: LeeHowesAll your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.

*beer*

*beer*

Originally posted by: LeeHowesI will say that I think all of the suggestions in this thread are CL 1.x suggestions, not CL 2.x. I'd rather think a lot bigger for CL 2 than this (though I can't guarantee that anyone else would agree with me). Does anyone have any suggestions in that direction?

Well... Let's start CL 2 from scratch?

What I want is more abstraction, currently we have to write one code for each target for performance reasons, in a good language the programmer should write what he wants to do and the compiler should be able to translate it on machine code that runs well on target, I'm not sure it is possible on a future extension of OpenCL with so many low level features being the base and defaults.

For the list of features that could be in CL 1.x mine is global sync, or a functional returning how many threads I can spawn if I want to have all running at same time.

EDIT: I almost forgot: Libraries.

LeeHowes · ‎09-23-2010

I think CL isn't the right place for that kind of abstraction. CL is a pretty print of the intermediate language of the device, you can't expect abstraction because you'd never be able to tune for optimality. What would be nice would be people developing high level languages on top of OpenCL (and I don't mean pragmas for C++) but that can come with time.

Global sync and spawning to fill are ok - you have to be careful because a lot of reasons for doing that end up occupying the device for too long.

eduardoschardong · ‎09-23-2010

Originally posted by: LeeHowes I think CL isn't the right place for that kind of abstraction. CL is a pretty print of the intermediate language of the device, you can't expect abstraction because you'd never be able to tune for optimality. What would be nice would be people developing high level languages on top of OpenCL (and I don't mean pragmas for C++) but that can come with time.

I don't agree with the part of OpenCL as an IL, if I'm writing a tool to make those abstractions and the generated CL code must be specific for each GPU vendor I would rather use each vendor-specific IL (CAL-IL and PTX), the performance will be better and the assembly-like syntax is easier to be used by automated tools.

Originally posted by: LeeHowes

Global sync and spawning to fill are ok - you have to be careful because a lot of reasons for doing that end up occupying the device for too long.

"Spawning to fill"? Since it already got a name my I assume it's on the roadmap?

Originally posted by: laobrasuca I cant see C programing without pointers at all. C++ containers are cool, but slower, not to mention java/c# stuff. Would you prefer opencl/opengl on a higher level language? I think the API as it is now is kind well done, even if stuff is maybe missing. Everyone who codes for performance (fast computing and strict memory usage) does so in C or C++ (avoiding STL containers as much as possible), if not in Fortran (yey).

Comparing to java/c# is not remotely fair, those two are JITed, the compiler doesn't have the same time to analyze the code, neither native C with C++ containers, the latter is writing on top of pointers, let's stick with C against C without pointers.

Pointers may hurt performance more than help it, a pointer just indicate a place in memory that may be anywhere, data may have any layout, many pointers may reference the same place, it may be read or write the compiler can't make many assumptions about it because it may be hard to predict before run-time, a (let's call, a buffer) buffer is something easier, the compiler can make some assumptions about where it start and where it ends, if it's readonly or not (this case allows for more aggressive caching and even explicit caching by software), how it's aligned and even if the data is actually spread over many places in memory. This last one is more interesting, like on Cypress memory access (actually, any GPU), when accessing memory the optimal stride is between 16 and 32 bytes, when accessing the LDS it's between 4 and 8 bytes, the second exposes a very common problem, __local float4 arrays used to always bank conflicts dropping LDS speed by half, after some driver releases it nows will store each component in a different place, now, imagine it's a pointer and the user does (float*)(void*)x where x is float4... Yes, it will work with current compiler because it will add several instructions, but this optimization could be simpler if there were no pointers, like in DC11, and now move this problem from the scope-limited LDS to global memory with structures larger than 32 bytes...

ps: (datatype1*)(void*)(datatype2*)x have a undefined behavior on C, and there is reason for it even thought may apps rely on this behavior, on CL this have a defined behavior

LeeHowes · ‎09-23-2010

If CL looks like C isn't a low level what we're going to expect? C doesn't magically vectorise for cell, SSE etc. How should CL be efficiently mapped to varying vector sizes? If you program naively, then it will of course, but then you end up with barriers everywhere and inefficient code. People need to program to the vector size with CL, I don't think that can be avoided while it is still low level. It still saves you knowing the assembly languages and instruction scheduling in the way C does.

Interesting observation about the name But no, I just got the name off the top of my head. It has come up in discussions, though.

douglas125 · ‎09-24-2010

Hi;

I'd like to point a few things I'd like to see in OpenCL 2.x:

1. Multiple devices automatic scalability. For example, let's say I have a very large image to filter and 3 GPUs. It'd be really good to be able to clEnqueueNDRangeKernel(device1+device2+device3) and have OpenCL distribute the load and manage memory accordingly;

2. A native way to transfer data in a compressed form (since bandwidth is almost always an issue).

3. Data types with greater precision than double. For instance, a data type "quadruple" or "quaddouble" that takes up the same structure as a float4, for example, and could operate on 128-bit precision. This would be really useful for computing residues in scientific software. The structure could be something like 4 floats which, when summed, give out the desired result, for host-device communication.

I have no idea on how implement any of these but they'd be Major contributions. Any hardware people out there to point how feasible these suggestions are?

bubu · ‎09-30-2010

Another suggestion:

- Pls, add a flag to see if the OpenCL GPU device has a watchdog attached or not. Currently it's a pain to execute kernels that require a long execution time because the Windows XP's watchdog pops resetting the graphics drivers and aborting your program's execution.

If you add this flag for the clGetDeviceInfo we could simply ignore the primary adapter's device and use other GPU present in the system.

Thanks.

aj_guillon · ‎10-03-2010

I would suggest:

- atomic float operations (especially float addition)

- access to what the compiler did (how did it pad the data structures?) This could greatly help to ensure that data is transferred properly between device/host, without having to spend a lot of time reverse-engineering what the compiler has done to your data on host / device sides.

- Access to the type of kernel arguments, so that a higher level libray can ensure that arguments are passed in correct order at runtime. In particular, think of adding something like reflection here, so that libraries that wrap OpenCL can ensure that kernel arguments are passed in safely and correctly.

Meteorhead · ‎10-03-2010

1: Let me join the people who are ready to kill for C++ support in kernels.

I am trying to create GPU accel to an enormous physics framework, which being a scientific app, it's mixed fortran, C and C++. Thus the code is full of externs, but what makes it the hardest, classes, function pointers, etc. I am only porting a managable part of the code, to prove GPU accel is not unreachable in a large, serial applications like this.

2: Function pointers alone are useful enough.

3: Libraries inside kernels.

Libraries can be achieved even now, if I append .cl codes one after the other and compile after, but mathematical libraries (for instance) used this way would result in excess register usage.

laobrasuca · ‎09-23-2010

Originally posted by: LeeHowes Pointers to pointers and pointers to buffers are very different things. Where do you unpack the buffer object to send it into the kernel correctly?

thx for the reply.

since i've already made a thread on this, it would be simpler if you take a look on it: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=139919&enterthread=y

it's kind having several buffers (each one being an array of, for example, floats) and wiling to use them at once in one same kernel. If pointers to pointers and pointers to buffers are very different things, yes i'm talking about pointer to buffers. But it seems that buffers are thread as pointers in the __kernel declaration (maybe i'm wrong), that's why i declare the pointer to buffers as a double pointer (**array_of_buffer), and that's why the compiler crashes with this pointer-to-pointer error. If you have any suggestion to make pointers-to-buffers useful in the kernel, i would be really grateful.

Float atomics are not a CL problem, they're a hardware problem. Integer ALUs take up barely any space, you can throw those around and do atomics on data in cache easily. Floating point has higher latency, it sits in a pipeline for longer. It's much much harder to synchronise. It doesn't really help for normals anyway, does it? Or maybe it is safe to do component by component on normals as long as you normalise later.

In other words "I've never understood why so, is that really hard to have them on float???" yes. It is. To do inefficient ones is sortof ok at the cost of a lot of lock data in cache, they would have to be a lot less efficient than integer ones are currently. There are also all sorts of questions about IEEE compliance of the atomic operations.

You can come close, though. You can do atomic exchanges on the data. You can treat floats as ints in many cases if you want to do atomic min, max, cmpxchg and of course the bitwise ones. You have to be slightly careful of course because min/max etc wouldn't handle denorms correctly.

thx for given an aperçu about the difficulty. as for the normals, i would tend to use atomics to average the normals of the triangles for a given vertex at the same time i compute them. it would avoid me to spend additional memory on creating a structure which holds the triangle number for each vertex (and as a consequence, avoid compute the normal for a triangle 3 times) (or additional buffer store temporary normals), while we already have the list of indices of vertex to construct each triangles. But as you say, it maybe would be better to compute the same normal several times than using atomic: you do more calculations but at least they are in parallel, while atomics are not.

All your comments about pointers amuse me because I wouldn't have put pointers in the language at all, given the choice.

oh, glad to amuse you (and eduardo), it's always a pleasure But, pointers are very useful on, for example, acquiring or release multiple GL buffers with one command call, or releasing several events with one command call, and stuff like these. I cant see C programing without pointers at all. C++ containers are cool, but slower, not to mention java/c# stuff. Would you prefer opencl/opengl on a higher level language? I think the API as it is now is kind well done, even if stuff is maybe missing. Everyone who codes for performance (fast computing and strict memory usage) does so in C or C++ (avoiding STL containers as much as possible), if not in Fortran (yey). Garbage collector to control the VRAM memory? No, thx. Or, maybe i'm miss understanding you and not talking about the same thing.

Illusio · ‎09-23-2010

Originally posted by: laobrasucaGarbage collector to control the VRAM memory? No, thx.

Actually, I'd kind of like that. If I could get a "unified" 64 bit virtual address space between the GPU and the host process that would basically treat the entire VRAM as a cache.

In such a regime a garbage collector could make pretty good sense given an OpenCL aware high level language. It's not like there would be a major downside to it anyway, with caching being handled by hardware through actual usage patterns and possibly programmed preloads.

LeeHowes · ‎09-23-2010

it's kind having several buffers (each one being an array of, for example, floats) and wiling to use them at once in one same kernel. If pointers to pointers and pointers to buffers are very different things, yes i'm talking about pointer to buffers. But it seems that buffers are thread as pointers in the __kernel declaration (maybe i'm wrong)

You're right and you're wrong. It is true that by the time the buffer reaches the kernel it's a standard pointer. The problem is that to achieve that state the runtime has to unpack the buffer object to extract the pointer, decide if the buffer is on the device or not and perform appropriate copies to ensure that it is if necessary.

If you hide that buffer inside an array the runtime has to be able to analyse the array to know that it contains buffers. Obviously you might want to generalise this further into structs of buffers or whatever.

In a programming language with reflection this isn't much of a problem. You can go and look at your data structures and analyse them properly. C, unfortunately, is not a very sophisticated language. It's very hard to do this sort of thing well - to get the API to work we might need a sequence of API calls to allow the user to describe the structure. Or we might limit it to a special "array of buffers" type, but then there would have to be a good design decision about why an array of buffers is necessary but a structure of buffers isn't.

You're not the only person who wants it, though, and I'll go as far as to say it has been discussed and may even happen at some point.

Would you prefer opencl/opengl on a higher level language?

Well I'm not really thinking at the API side, I'm thinking more in-kernel. It's a complicated question, though. A high level language might be the right palce as long as it keeps enough information so that the CL compiler doesn't have to do stupid things like assume aliased pointers because it doesn't know any better.

I read C++ support as meaning in-kernel as well. I'd like to see that too.

I should add: Obviously comments in this thread are personal opinions. They may or may not relate in any way to any plans AMD or Khronos has.

sir_um · ‎09-23-2010

Originally posted by: bubu

Please, add this:
3. Add C++ support.

Originally posted by: davibu

I'm ready to kill for the C++ support too

This is kind of vague, but if you are referring to the ability to use OpenCL in C++ applications, there are C++ bindings. You can download them from here:

http://www.khronos.org/registry/cl/

you want cl.hpp.

Here is the Spec doc for the C++ bindings:

http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf

-Chris

tak0xff · ‎10-04-2010

Originally posted by: bubu Please, add this:

6. Add some reduction macros or functions for +, -, *, /, min/max, etc... And add a quick-sort ( or radix sort ) intrinsic:

I think it is difficult to provide flexible reduction functions.

But it is easy to provide simple reduction functions in a group.

These are not only to simplify coding but to promote some optimization.

Here is an example.

The function group_reduction_sum calculates sum of arguments passed to each threads in a group.

Optimized version can save shared memory usage.

#define SIZE 1024 #define WAVESIZE 64 __kernel void test(__global uint *i,__global uint *o) { uint id=get_global_id(0); uint grid=get_group_id(0); uint mydata=i[id]; uint sum; //sum=group_reduction_sum(mydata); if (0) { // normal version __local uint buf[SIZE]; uint lid=get_local_id(0); buf[lid]=mydata; barrier(CLK_LOCAL_MEM_FENCE); uint stride=SIZE>>1; for (;lid<stride;stride>>=1) { buf[lid]+=buf[lid+stride]; barrier(CLK_LOCAL_MEM_FENCE); } sum=buf[0]; } { // optimized version for evergreen __local uint buf[SIZE/WAVESIZE]; __local uint buf2[WAVESIZE<<1];/*shared all groups in the compute unit*/ uint lid=get_local_id(0); uint wid=lid & (WAVESIZE-1); uint odd_even=lid&WAVESIZE; uint stride=WAVESIZE>>1; mem_fence(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE); // { should be within 1 ALU closure buf2[wid+odd_even]=mydata; for (;stride>0;stride>>=1) { buf2[wid+odd_even]+=buf2[(wid^stride)+odd_even]; } sum=buf2[odd_even]; // } mem_fence(CLK_LOCAL_MEM_FENCE); if (wid==0) { buf[lid/WAVESIZE]=sum; } barrier(CLK_LOCAL_MEM_FENCE); for (stride=SIZE/WAVESIZE/2;lid<stride;stride>>=1) { buf[lid]+=buf[lid+stride]; barrier(CLK_LOCAL_MEM_FENCE); } sum=buf[0]; } if (id & (SIZE-1) == 0) { o[grid]=sum; } }

Archives Discussions

Suggestions for OpenCL 2