Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Suggestions for OpenCL 2

Originally posted by: bubu

Please, add this:

 3. Add C++ support.

Originally posted by: davibu

I'm ready to kill for the C++ support too

This is kind of vague, but if you are referring to the ability to use OpenCL in C++ applications, there are C++ bindings. You can download them from here:

you want cl.hpp.

Here is the Spec doc for the C++ bindings:



Suggestions for OpenCL 2


it's kind having several buffers (each one being an array of, for example, floats) and wiling to use them at once in one same kernel. If pointers to pointers and pointers to buffers are very different things, yes i'm talking about pointer to buffers. But it seems that buffers are thread as pointers in the __kernel declaration (maybe i'm wrong)

You're right and you're wrong. It is true that by the time the buffer reaches the kernel it's a standard pointer. The problem is that to achieve that state the runtime has to unpack the buffer object to extract the pointer, decide if the buffer is on the device or not and perform appropriate copies to ensure that it is if necessary.

If you hide that buffer inside an array the runtime has to be able to analyse the array to know that it contains buffers. Obviously you might want to generalise this further into structs of buffers or whatever.

In a programming language with reflection this isn't much of a problem. You can go and look at your data structures and analyse them properly. C, unfortunately, is not a very sophisticated language. It's very hard to do this sort of thing well - to get the API to work we might need a sequence of API calls to allow the user to describe the structure. Or we might limit it to a special "array of buffers" type, but then there would have to be a good design decision about why an array of buffers is necessary but a structure of buffers isn't.

You're not the only person who wants it, though, and I'll go as far as to say it has been discussed and may even happen at some point.


Would you prefer opencl/opengl on a higher level language?

Well I'm not really thinking at the API side, I'm thinking more in-kernel. It's a complicated question, though. A high level language might be the right palce as long as it keeps enough information so that the CL compiler doesn't have to do stupid things like assume aliased pointers because it doesn't know any better.

I read C++ support as meaning in-kernel as well. I'd like to see that too.


I should add: Obviously comments in this thread are personal opinions. They may or may not relate in any way to any plans AMD or Khronos has.

Journeyman III

Suggestions for OpenCL 2

Originally posted by: LeeHowes I think CL isn't the right place for that kind of abstraction. CL is a pretty print of the intermediate language of the device, you can't expect abstraction because you'd never be able to tune for optimality. What would be nice would be people developing high level languages on top of OpenCL (and I don't mean pragmas for C++) but that can come with time.


I don't agree with the part of OpenCL as an IL, if I'm writing a tool to make those abstractions and the generated CL code must be specific for each GPU vendor I would rather use each vendor-specific IL (CAL-IL and PTX), the performance will be better and the assembly-like syntax is easier to be used by automated tools.


Originally posted by: LeeHowes 

Global sync and spawning to fill are ok - you have to be careful because a lot of reasons for doing that end up occupying the device for too long.


"Spawning to fill"? Since it already got a name my I assume it's on the roadmap?



Originally posted by: laobrasuca I cant see C programing without pointers at all. C++ containers are cool, but slower, not to mention java/c# stuff. Would you prefer opencl/opengl on a higher level language? I think the API as it is now is kind well done, even if stuff is maybe missing. Everyone who codes for performance (fast computing and strict memory usage) does so in C or C++ (avoiding STL containers as much as possible), if not in Fortran (yey). 

Comparing to java/c# is not remotely fair, those two are JITed, the compiler doesn't have the same time to analyze the code, neither native C with C++ containers, the latter is writing on top of pointers, let's stick with C against C without pointers.

Pointers may hurt performance more than help it, a pointer just indicate a place in memory that may be anywhere, data may have any layout, many pointers may reference the same place, it may be read or write the compiler can't make many assumptions about it because it may be hard to predict before run-time, a (let's call, a buffer) buffer is something easier, the compiler can make some assumptions about where it start and where it ends, if it's readonly or not (this case allows for more aggressive caching and even explicit caching by software), how it's aligned and even if the data is actually spread over many places in memory. This last one is more interesting, like on Cypress memory access (actually, any GPU), when accessing memory the optimal stride is between 16 and 32 bytes, when accessing the LDS it's between 4 and 8 bytes, the second exposes a very common problem, __local float4 arrays used to always bank conflicts dropping LDS speed by half, after some driver releases it nows will store each component in a different place, now, imagine it's a pointer and the user does (float*)(void*)x where x is float4... Yes, it will work with current compiler because it will add several instructions, but this optimization could be simpler if there were no pointers, like in DC11, and now move this problem from the scope-limited LDS to global memory with structures larger than 32 bytes...

ps: (datatype1*)(void*)(datatype2*)x have a undefined behavior on C, and there is reason for it even thought may apps rely on this behavior, on CL this have a defined behavior


Suggestions for OpenCL 2

If CL looks like C isn't a low level what we're going to expect? C doesn't magically vectorise for cell, SSE etc. How should CL be efficiently mapped to varying vector sizes? If you program naively, then it will of course, but then you end up with barriers everywhere and inefficient code. People need to program to the vector size with CL, I don't think that can be avoided while it is still low level. It still saves you knowing the assembly languages and instruction scheduling in the way C does.

Interesting observation about the name But no, I just got the name off the top of my head. It has come up in discussions, though.

Adept I

Suggestions for OpenCL 2



I'd like to point a few things I'd like to see in OpenCL 2.x:


1. Multiple devices automatic scalability. For example, let's say I have a very large image to filter and 3 GPUs. It'd be really good to be able to clEnqueueNDRangeKernel(device1+device2+device3) and have OpenCL distribute the load and manage memory accordingly;

2. A native way to transfer data in a compressed form (since bandwidth is almost always an issue).

3. Data types with greater precision than double. For instance, a data type "quadruple" or "quaddouble" that takes up the same structure as a float4, for example, and could operate on 128-bit precision. This would be really useful for computing residues in scientific software. The structure could be something like 4 floats which, when summed, give out the desired result, for host-device communication.

I have no idea on how implement any of these but they'd be Major contributions. Any hardware people out there to point how feasible these suggestions are?

Adept II

Suggestions for OpenCL 2

Another suggestion:


- Pls, add a flag to see if the OpenCL GPU device has a watchdog attached or not. Currently it's a pain to execute kernels that require a long execution time because the Windows XP's watchdog pops resetting the graphics drivers and aborting your program's execution.


If you add this flag for the clGetDeviceInfo we could simply ignore the primary adapter's device and use other GPU present in the system.



Adept I

Suggestions for OpenCL 2

I would suggest:

- atomic float operations (especially float addition)

- access to what the compiler did (how did it pad the data structures?)  This could greatly help to ensure that data is transferred properly between device/host, without having to spend a lot of time reverse-engineering what the compiler has done to your data on host / device sides.

- Access to the type of kernel arguments, so that a higher level libray can ensure that arguments are passed in correct order at runtime.  In particular, think of adding something like reflection here, so that libraries that wrap OpenCL can ensure that kernel arguments are passed in safely and correctly.


Suggestions for OpenCL 2

1: Let me join the people who are ready to kill for C++ support in kernels.

I am trying to create GPU accel to an enormous physics framework, which being a scientific app, it's mixed fortran, C and C++. Thus the code is full of externs, but what makes it the hardest, classes, function pointers, etc. I am only porting a managable part of the code, to prove GPU accel is not unreachable in a large, serial applications like this.

2: Function pointers alone are useful enough.

3: Libraries inside kernels.

Libraries can be achieved even now, if I append .cl codes one after the other and compile after, but mathematical libraries (for instance) used this way would result in excess register usage.

Journeyman III

Suggestions for OpenCL 2

Originally posted by: bubu Please, add this:

6. Add some reduction macros or functions for +, -, *, /, min/max, etc... And add a quick-sort ( or radix sort ) intrinsic:




I think it is difficult to provide flexible reduction functions.

But it is easy to provide simple reduction functions in a group.

These are not only to simplify coding but to promote some optimization.

Here is an example.

The function group_reduction_sum calculates sum of arguments passed to each threads in a group.

Optimized version can save shared memory usage.

#define SIZE 1024 #define WAVESIZE 64 __kernel void test(__global uint *i,__global uint *o) { uint id=get_global_id(0); uint grid=get_group_id(0); uint mydata=i[id]; uint sum; //sum=group_reduction_sum(mydata); if (0) { // normal version __local uint buf[SIZE]; uint lid=get_local_id(0); buf[lid]=mydata; barrier(CLK_LOCAL_MEM_FENCE); uint stride=SIZE>>1; for (;lid<stride;stride>>=1) { buf[lid]+=buf[lid+stride]; barrier(CLK_LOCAL_MEM_FENCE); } sum=buf[0]; } { // optimized version for evergreen __local uint buf[SIZE/WAVESIZE]; __local uint buf2[WAVESIZE<<1];/*shared all groups in the compute unit*/ uint lid=get_local_id(0); uint wid=lid & (WAVESIZE-1); uint odd_even=lid&WAVESIZE; uint stride=WAVESIZE>>1; mem_fence(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE); // { should be within 1 ALU closure buf2[wid+odd_even]=mydata; for (;stride>0;stride>>=1) { buf2[wid+odd_even]+=buf2[(wid^stride)+odd_even]; } sum=buf2[odd_even]; // } mem_fence(CLK_LOCAL_MEM_FENCE); if (wid==0) { buf[lid/WAVESIZE]=sum; } barrier(CLK_LOCAL_MEM_FENCE); for (stride=SIZE/WAVESIZE/2;lid<stride;stride>>=1) { buf[lid]+=buf[lid+stride]; barrier(CLK_LOCAL_MEM_FENCE); } sum=buf[0]; } if (id & (SIZE-1) == 0) { o[grid]=sum; } }