Free fglrx from the clutches of Xserver. (It might be driver related)
At the moment linux driver is very intimately related to Xserver, but HPC applications would welcome the fact if the drivers could be loaded without a graphical interface having to run in the backgroud. I am only mentioning this, because I would like to integrate OpenCL into a system which is SLC based (Scientific Linux CERN), and this OS is a MUST. It is a minimalistic Red Had distro with many useful libraries included for scientific use, but the GUI is extremely unstable, and the monolithic grid infrastructure features worker nodes with GUI-less SLC. It would simplify things if drviers could be loaded similar to NV cards. This would free VRAM also, desktop wouldn't occupy memory, plus desktop rendering wouldn't hinder the default driver. ATM default adapter (utilized for desktop rendering) has to be AMD, otherwise fglrx fails to detect AMD GPUs.
Proper 5970/6990 support functioning as two independant devices.
It is clear that the differing strategies of monolithic die vs. sweet spot predestines AMD to create dual-GPU solutions if it wishes to compete in the GPGPU market and the high-end section of gaming graphics. I really like the fact AMD does not differentiate GPUs artifically by deteriorating GPU capabilities to sell the expensive versions for HPC computing (like GTX and Tesla differ in this sense, most significantly in DP performance). The fact however that in each generation the flagship on the AMD front is a dual-GPU solution, it would be nice to create proper software support for these pieces of hardware.
If one has two single GPU-cards connected via CrossFire, it can be disabled through software and both cards can be utilized properly. In my mind this can only be impossible in the case of 5970 cards if some things were hardwired into the HW that prevents the use of this software disabling of CF. Although 6990 is not out yet, I very much hope AMD did not design it the same way 5970 was done in this matter. I do not know how much extra performance can be squeezed out by hardwiring things, but whatever is done that differs from a regular CF connector is not worth all the aggravation that comes when creating software support. Gaming performance of these cards are vastly determined by game engine design and not by hardware aiding. I very much hope 6990 will be free of the 5970's illnesses, but if not, 7xxx MUST start right off with full support (IMHO) if AMD wishes to participate in the HPC market.
full DP support with cl_khr_fp64.
expose GDS counters. so we can have very fast append buffers and other goodies from fast global atomics.
and make this thread sticky on top.
experimental C++ features as extension (templates and classes most importantly)
These features can be reused when OCL standard shifts from C99 to C++. It would be nice if AMD could take the lead in software support, even if by this little. We could start working ahead too.
A list of dependencies containing kernel and library versions instead of just 3 major distributions in their current version.
This would ease the deployment in many cluster environments where only custom and seldomly updated distributions are available.
Reduce latencies on kernel launching and syncronization point.
Pls, excuse my english ... and there are 17 suggestions for you from our team... Some are ATI-specifyc but others are pretty generic, I hope you don't mind. I have more but I really don't want to make the post too large and difficult to read...
0. Make your LLVM JIT compiler open source. People then can detect bugs or to suggest optimizations easier. Also, please tell Khronos to make the ICD trampoline's source code public
1. Get rid of the console window on clBuildProgramFromSource. Popping a console window simply does not look very professional.... although it's just an aesthetics problem really.
2. Solve the N/A in the SKA tool for large kernels. Without this, I simply cannot optimize properly. Our kernels ( we're using more than 3k lines ) always emit N/As. Very frustrating.
3. Modify the SKA tool to save precompiled kernels as the Intel's OIC tool does. This is good to reduce the kernel's loading times + to control better the optimizations/register pressure/compatibility across drivers + to protect your source code.
This is very important because if you ask some commercial apps's developers why they use CUDA they may say "We simply don't feel comfortable giving our kernel's source code with the app".
I personally like CUDA because kernel precompilation makes my code much more resistant and optimized across drivers ( imagine: FW 190 CL JIT compiler outputs 16 regs... but FW260 outputs 18 regs ruining my occupancy computations! That happens a lot btw )
4. Create a static code analyzer with two primary targets: to detect bugs/buffer overruns,etc (like cppcheck does with C++ code) and to detect performance problems ( for instance, emitting a warning message in case the code is not coalescing correctly the global memory ).
5. Make a visual debugger like NVIDIA's nSight ( although I would make it a portable standalone app instead tho ). I personally think that printf is not enough for debugging effectively. I'm interestered in placing breakpoints, inspecting variables and looking at the call stack. Also would be interesting to implement a "run to cursor" to skip several lines or to execute previous lines of code. Other problem of printf is that we use the CPU to debug but we can get sigthly different results from the GPU.
6. Allow us to use OpenCL from a Windows Service without having to log in as NVIDIA Tesla's drivers do. This is critical for computing clusters: the server guys simply don't like to log in the 10000 nodes ... Seriously, we need a way in windows to use OpenCL without having to be logged as user.
For linux, pls allow us to call OpenCL without having XWindows running ( although we could loose the GL interop ).
7. Improve compatibility with unix/linux distros. I have the impression you prioritize Windows/Mac but those OSs's presence in the HPC world is very small. Just see the official Top500 Supercomputer 2010 list...
( Windozeds, pls don't enter there unless you've a strong heart! )
On the other hand, a BSD or Solaris port won't neither hurt because there are a lot of databases and server apps that could expriment with OpenCL ...
8. Contact the most popular GPGPU apps/libs developers and offer them complete support and hardware. We need some practical examples / sucess histories / programs using GPGPU. Seriously, you need to be much more agressive. Much more. That not only should include commercial projects but also free/open source popular ones. This will become a direct sales increment for you just supporting a bunch of popular apps/libs.
Did I say "GPGPU app store"?. Nope, but to send a few emails asking "What you need to use OpenCL and to improve your performance and compatibility with ATI cards/AMD Cpus for your app XXXX?" to a bunch of key programmers can be very simple and effective.
9. Invest in training/learning. Improve the docs. Write books like NVIDIA's GPU Gems series and make world conferences like NVIDIA's GTC 2010. Btw, I also think you should include some of these PDF docs in the SDK directly instead of linking to your SDK docs's web page.
Also, add new examples ( which are always welcome ). For instance, I would like to see a simple and clear radix sort of integer key/value pairs.
Another interesting one: sort a transparent triangle soup and show it using GL/DX interop ( something like you did in your DX OIT demo, but much simplified ).
Other one: a simple triangle rasterizer.
10. We need some extensions or changes for the next CL spec: mipmapping, cube maps, images in structs/arrays of images of different size, C++/template support, function overloading and recursion, function pointers, dynamic memory allocation(yes, calloc/malloc/realloc inside kernels), explicit control of the memory caches(I want to specify manually what/when should be cached and what/when not), concurrent kernels/hyperthreading, device fission for GPUs, etc...
11. We need more **official** higher-level libraries like BLAS, FFT, reductions, qsort, STL containers, MD5/SHA1, etc... Making these libs open source will be fantastic so the people could improve them and find bugs.
Why I said "official". Well... of course these libs could be developed by a 3rd party... but will be fantastic to get them directly from you, optimized and tested for your specific platform, ready to use.
12. We need a way to control the GPU's priority and to abort time-consuming kernels. Let me put an example: imagine the user can only use one GPU, the one attached to the monitor. Imagine the GPU task is very intensive so the OS's GUI gonna lag a lot...
Almost all the artists I know want to continue using their computer while the render is computed in second plane. If a kernel takes all the GPU resources they won't be able to continue working until the CL task is done.
Perhaps you could add an extension to pass a value from 0 to 100 to a clXEnqueueNDRange() function to control the kernel execution's priority.
On the other hand, for a very time-consuming task that could not be done progressively or using multiple passes, we'll need a reasonable way to abort it having the watchdog disabled.
The ideal could be to make the GPU to support preemptive (or similar) multitasking and to add to the Catalyst's CCC a task-manager like the one Windows uses... but, I know, this can be very hard to implement.
13. Please, support dual-GPU cards better. You should support properly your flagship GPUs omg!
14. Remove the environment variable limitations. Let the user to allocate all the available memory if the app requires it ( the user knows what he's doing. An automatism not ). Also, increase the maximum 1D image size to CUDA's levels ( 1D texture==2^27 elements max ) allowing us to use "jumbo" 1D-linear textures.
15. Implement the clBuildProgramFromSource's optimizations options: -cl-relaxed-math, mad, neg zeros, etc... which ( I think ) are currently ignored. Also make sure your compiler is optimizing properly the const/restrict/register flagged variables.
16. CL Wrappers. Wrapper writers would need also the CL functions "typedefed" as you did with GL. I think Pyrit's programmers were asking for this ages and ages...
On the other hand, I would like see some of the SDK examples written in other languages different than C/C++. For example, in Java, C#, VB, python, Objective-C, etc...
17. Memory virtualization. One of the problems with GPGPU is the relatively low quantity of RAM that the cards have. Perhaps a solution is to add a flag to the clCreateBuffer/Image to specify we want to use memory virtualization. In that way we could use buffers much larger than the card could store in the VRAM. You could page using the PCI bus or something similar to the x86 architecture does.
As the programmer will mark the buffer manually with that flag, he will be aware the performance problems that can occur due to the PCI-memory swapping.... but hey! if he wanted to use that it's because he accepts the trade of some speed as cost for using more memory than the card has.
And, now that I mention the word "virtualization" ( although in a completely different context ), you could also talk with Parallels, Xen, VMWare, etc... to define an optimal interface to use OpenCL across OSs.
Woah, seems I just wrote a book Zzzzzzzzzzz
Top ten suggestions win a Zacate APU MB as prize, right, Himanshu?
Addition to GDS, GWS via a simple function call. Atomics can do the trick, but surely compiler gurus can create a better solution than me.
+1 on the break away from X.....PLEASE
This would allow for infinitely easier usage of AMD GPUs on clusters by also removing the users owning the X server and such stuff. I understand that less money is made from GPU clusters than graphics but it would be really nice...
Ideally an AMD roll for Rocks clusters would be awesome
Another thing would be more accurate event handling....
bubu, they've already worked the suggestion no 1 out
1. For me, improve OpenGL/OpenCL interoperability. As of now, lots of delay are introduced on the createaquire buffers step, specially when manipulating big buffers (> 256MB). Could we imagine OpenGL/OpenCL interoperability as GLSL, meaning, no need of acquiring/releasing buffers all the time? Would be nice to have a render mode where instead of calling vertex/geometry shaders, we would call OpenCL kernel, with proper input and outputs specific to rendering. At this date, OpenCL is more flexible than shaders (?), maybe someday OpenGL/OpenCL interoperability will disappear and will be replaced by improved shaders? It seems to me it's more of a conceptual matter than technical difficulty.
2. Pointers to buffers (or similar thing). I want to be able to treat hundreds or thousands of buffers with a single kernel call. This is maybe not an SDK problem, but more generally an OpenCL spec problem. If you can fight for it on the OpenCL committee with Khronos I would send you flowers
3. Get rid of the high kernel launch latency from GPU devices. Unfortunately, the 2.3 version has not completely addressed this point. Hope things can get improved. I think it is something related to the GPU hardware, as the CPU do not suffer from this problem. Maybe something related to the power management?
I agree with the above especially regarding the CL/GL interop improvement, high-level libraries (FFT, BLAS, etc) and full fp64 support.
I'd like to add the following:
- Possibility to use GPUs without attached monitors using Windows - I'm still unable to use the GPU unless I plug a monitor or a dummy plug with a radeon 4970, is there any update I'm not aware of?
- Image processing toolkit - Set of optimized functions to import popular formats like JPS, TIF etc and videos;
One more thing based on a forum post:
- GPU reboot - Some way to reboot the GPU when the CL code freezes, just like windows does with TdrDelay environment variables. I have in mind a command to be issued by the application when the user presses some combination of keys.
overlaping transfer/computation. so we can transfer one buffer and compute from other.
I completely agree with everyone suggestions, but i think that freeing the fglrx from the Xserver should be a top priority.
For example how will APP SDK will run on fusion, will AMD force every vendor to use Xserver? Now with distributions (ubuntu) switching to other X Display server's (Wayland), Xserver may stop being the standard.
Support for the standard cl_khr_fp64 should be preferred to the vendor cl_amd_fp64.
Overlapping computing with memory transfer is no only a must for the SDK but a excellent feature to the standard.
I'd like a compiler that doesn't prioritise speed of compilation over quality. Perhaps some kind of option?
What I see repeatedly is that the compiler produces absurd GPR allocations. Some of these are clearly cases where no global GPR allocation tuning is performed. The compiler "just gives up" because the code is "too long".
At some point you have to get out of the JIT mind set. Some of us compile a kernel once and then run the kernel for hours. Or days.