bubu, they've already worked the suggestion no 1 out
1. For me, improve OpenGL/OpenCL interoperability. As of now, lots of delay are introduced on the createaquire buffers step, specially when manipulating big buffers (> 256MB). Could we imagine OpenGL/OpenCL interoperability as GLSL, meaning, no need of acquiring/releasing buffers all the time? Would be nice to have a render mode where instead of calling vertex/geometry shaders, we would call OpenCL kernel, with proper input and outputs specific to rendering. At this date, OpenCL is more flexible than shaders (?), maybe someday OpenGL/OpenCL interoperability will disappear and will be replaced by improved shaders? It seems to me it's more of a conceptual matter than technical difficulty.
2. Pointers to buffers (or similar thing). I want to be able to treat hundreds or thousands of buffers with a single kernel call. This is maybe not an SDK problem, but more generally an OpenCL spec problem. If you can fight for it on the OpenCL committee with Khronos I would send you flowers
3. Get rid of the high kernel launch latency from GPU devices. Unfortunately, the 2.3 version has not completely addressed this point. Hope things can get improved. I think it is something related to the GPU hardware, as the CPU do not suffer from this problem. Maybe something related to the power management?
I agree with the above especially regarding the CL/GL interop improvement, high-level libraries (FFT, BLAS, etc) and full fp64 support.
I'd like to add the following:
- Possibility to use GPUs without attached monitors using Windows - I'm still unable to use the GPU unless I plug a monitor or a dummy plug with a radeon 4970, is there any update I'm not aware of?
- Image processing toolkit - Set of optimized functions to import popular formats like JPS, TIF etc and videos;
One more thing based on a forum post:
- GPU reboot - Some way to reboot the GPU when the CL code freezes, just like windows does with TdrDelay environment variables. I have in mind a command to be issued by the application when the user presses some combination of keys.
I completely agree with everyone suggestions, but i think that freeing the fglrx from the Xserver should be a top priority.
For example how will APP SDK will run on fusion, will AMD force every vendor to use Xserver? Now with distributions (ubuntu) switching to other X Display server's (Wayland), Xserver may stop being the standard.
Support for the standard cl_khr_fp64 should be preferred to the vendor cl_amd_fp64.
Overlapping computing with memory transfer is no only a must for the SDK but a excellent feature to the standard.
I'd like a compiler that doesn't prioritise speed of compilation over quality. Perhaps some kind of option?
What I see repeatedly is that the compiler produces absurd GPR allocations. Some of these are clearly cases where no global GPR allocation tuning is performed. The compiler "just gives up" because the code is "too long".
At some point you have to get out of the JIT mind set. Some of us compile a kernel once and then run the kernel for hours. Or days.
5. Make a visual debugger like NVIDIA's nSight ( although I would make it a portable standalone app instead tho ). I personally think that printf is not enough for debugging effectively. I'm interestered in placing breakpoints, inspecting variables and looking at the call stack. Also would be interesting to implement a "run to cursor" to skip several lines or to execute previous lines of code. Other problem of printf is that we use the CPU to debug but we can get sigthly different results from the GPU.
That and global sync and i i'm good
Originally posted by: Meteorhead
Free fglrx from the clutches of Xserver.
Proper 5970/6990 support functioning as two independant devices
Originally posted by: ibird
Reduce latencies on kernel launching and syncronization point.
Originally posted by: bubu
0. Make your LLVM JIT compiler open source.
14. Remove the environment variable limitations.
Originally posted by: douglas125
- Possibility to use GPUs without attached monitors
1a) Persistant registers for long numerical solvers which must transfer control to the CPU from time to time.
1b) Device can trigger memory transfers to host, and events that the host can act on ... while the device remains in the solver loop.
2) Headless X-less support in linux drivers
3) Multi-GPU support in OpenCL ... until then I'm stuck with CAL/IL
4) Multi-GPU interleaving host<->device transfers with compute
5) GDS, double its size each generation.
6) Fusion device that can keep up with a 6970, but with a unified memory space with the CPU.
Now that is a amazing effort from the developer community waiting to use openCL for serious purposes. It is great to see such large and illustrative feedback. Thank you all for your time and efforts.
Command and control (C2)
At the moment OpenCL offers only the queue mechanism for C2.
Please, add the pipe mechanism (at least).
We have used both pipe and queue for command and control between processes and threads in multiprocessing/multi-core programming in CPU side. It has proved to be efficient - queue to put/get data and pipe to send/receive command and control messages.
At the moment the OpenCL approach is more for one CPU and one GPU.
Please, add more support for multiprocessing/multi-core programming.
We have found that while multiprocessing/threading improves multi-core usage it also improves the whole system performance and gives resources for more tasks - in OpenCL case it would boost CPU side performance.