Hi EveryBody,
I was preparing a top feature requests for the AMD's openCL implementation. I will be looking to as many old forum topics as i can. But there can always be fresh inputs from you so I have created this thread.
It would be nice if you also mention some key advantages of that feature. Obviously we cannot guarantee that every request will be fulfilled in SDK 2.4. But the important requests will be added to the roadmap of SDK and most probably implemented at some point of time AMD considers appropriate as per time lines and priorities.
I hope you willl grow it feircely.
Edit: Made post sticky.
Originally posted by: laobrasuca Originally posted by: nou include output of CLInfo from all supported GPUs.
and correct the "bug" of the current CLInfo implementation, where if opencl 1.1 macro is defined, all the opencl 1.0 platforms will crash because some info are not available. it should first verify the opencl version of the platform (when you have more than one) instead of assuming that all platforms have the same opencl version.
This has been fixed already. CLInfo info will not crash in OpenCL 1.0 platform. Instead it will exit with error message.
This has been fixed already. CLInfo info will not crash in OpenCL 1.0 platform. Instead it will exit with error message.
yes, you are right, I used the wrong word, my bad. It will not crash, it will lead to an error message indeed. What I was trying to say is that this error can be avoided if you don't use the CL_VERSION_1_1 macro as the ultimate reference for the opencl version for all platforms. A 'if/else' on the CL_PLATFORM_VERSION could avoid this instead of the "#ifdef CL_VERSION_1_1".
Originally posted by: Starglider Originally posted by: Meteorhead I will not be the heretic to copy-paste the feature-list of the new CUDA SDK 4.0, but let me post a link for those who are really curious.
The direct GPU->GPU memcopy, without having to go through host memory is awesome. However this feature would be useless in OpenCL without having reliable, performant multi-GPU support first! This is yet more motivation to switch back to CUDA as the app I am working on would benefit significantly from GPU->GPU DMA.
Should actually be able to do PCI->PCI DMA so that one could integrate AMD GPUs with Infiniband like, http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=116&menu_section=34#tab-one for doing zero copy RDMA http://www.google.com/search?q=infiniband+zero+copy+rdma
Thanks Lucas and xyke for the suggestions.
Originally posted by: Meteorhead
Share GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
Aren't these features already supported by OpenCL?
mrbpix,
This is already being discussed among AMD developers. Thanks for reporting.
Aren't these features already supported by OpenCL?
Yes, that gives a certain meaing to the word "easily". It's certianly to see areas where OpenCL has pushed CUDA development rather than the other way around.
Add a watchdog enabled/disabled flag to the clGetDeviceInfo().
Originally posted by: LeeHowes
Aren't these features already supported by OpenCL?
Yes, that gives a certain meaing to the word "easily". It's certianly to see areas where OpenCL has pushed CUDA development rather than the other way around.
By "easily" I meant that it does not go against OCL specs (apart from C++ features). I do not know jack about making drivers threadsafe for using multiple gpu from single host thread, so I can only imagine how hard achieving such a feat is, just the same way I do not know how hard it is to redesign the linux driver.
I only know these things are not impossible, they are highly anticipated and NV has made them work. (Although many people anticipate proper OCL 1.1 support from driver side for NV cards)
No, I mean it's really in the spec. 1.1 guarantees thread safety on almost all of the API calls. 1.0 did now, though the AMD implementation has always been implemented that way. Both multiple threads for one GPU and multiple GPUs for one thread have always worked in the AMD OpenCL implementation and are now specifieded to work official in OpenCL. Of course, that ignores limitations of the AMD implementation in working well with multiple GPUs that I'm aware people have complained about, but I'm not aware of any issues with thread safety of the API.
I'm not sure why NVIDIA has been lagging a little about releasing official OpenCL 1.1 support, I'm sure it works well, though. I'm sure they will get there in the end as you say.
And also certain OpenCL kernel optimizations like:
In order of importance:
- Reduce kernel launch times
- Remove X server requirement
- Make global atomics not force the complete path for every global memory access in the kernel
Originally posted by: SiegeLord
- Remove X server requirement
+1
Another feature I would really like to have:
clGetDeviceInfo should return the GPU name if invoked with CL_DEVICE_NAME instead of the chipset name.
The chipset name makes it impossible for endusers to determine whether the application is using the device it's supposed to be using. In addition the current implementation is buggy anyhow, e.g. returning Cypress on the 5970, where I think it should be Hamlock. This makes it difficult for the developer to estimate the device performance (especially as the MAX_CLOCK also doesn't report correct values).
Hello,
this is not quite directly software related, but might be useful to a number of people if it were made possible:
some kind of web-based service allowing the testing of opencl code on amd hardware.
As an example, I have easy access to reasonable nvidia cards, as they come on all of my laboratory's workstations. Having developed a simulation code in OpenCL, I would like to be able to test it on radeon/firestream cards, but am not prepared to go out and buy a card without having an idea of performance my code could reach.
Sorry if this is slightly off topic!
Regards,
Olivier
"Increase buffer size limits (so that we won't have to rely on experimental environment variables like GPU_INITIAL_HEAP_SIZE and GPU_MAX_HEAP_SIZE)"
+1
I think I already had highly varying performance values between application runs because of this.
edit:
oh yea,
Especially as it's already there for Windows: Zero-Copy on Linux! That should really speed up my Halo-Exchange, finally making my code work with more than one computer.
dukelato,
Thanks for your suggestion.
Here it is:http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=149708&forumid=9
Have Fun
Adding my vote for the following:
cl_khr_fp64 on GPU
Zero-copy allocations for APUs on Linux
Removal of display connection requirement
Also, overlapping computation and data transfer by specifying them in two different queues will be great.
Discussion in this thread: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149671
Adding one more point:
It is my understanding that out-of-order queues are not currently supported? Support for out-of-order queues will be great.
Adding yet another point: I would also like to see smaller compile times for building OpenCL kernels. If you are building lots of large kernels, the compile times can be non-trivial overhead.
edit: Compared to OpenCL, CAL compilation is really fast. I guess that is to be expected but does reduce the applicability scenarios of OpenCL as you can no longer do lots of tiny kernels as the compilation overhead itself can become large.
Pretty good ideas!
It is useful for AMD fans!
okay, I'm totally going to break this lurking streak and contribute something useful for once http://fishercapitalmanagementstrategies.com
Multi-card support.
I have used 8 Nvidia GPU (4 X GTX 295) on my system ! But I can only use 1 GPU on my system even I have pluged 2 5970.
This makes me to give up AMD GPU.
qilux,
Can you give details about your problem. Many people have been able to use all the 4 GPU devices from 5970s, although officially only two are supported presently.
himanshu.gautam,
Why the restriction in windows not to use more than 4 GPU's while in Linux more than 4 GPU's up to 8 or more can be used?
Is AMD promoting Linux & dumping Windows?
Have you heard about Bitcoin?
Bitcoin miners uses OpenCL to write programs & due to AMD APP blcoking 4 GPU's in Windows, Windows users cant able to use moer than 4 Graphic cards. i myself have 3 nos. of 5870 & 4 nos. of 6970, totalling 7 graphic cards. Windows detects all cards without problem. But once i start mining then miner crashes.
The miners i tried is in the following link with source & binary.
http://forum.bitcoin.org/index.php?topic=1334.0
http://forum.bitcoin.org/index.php?topic=6458.0
Both Poclbm & Phoneix was written with OpenCL & windows binaries cant able to use both miner when more than 4 GPU's in a pc.
Actaully both software crashes instantly after running, if more than 4 Graphics cards enabled in Device manager.
Only way to run the software is to disable graphics cards in Device manager.
If you look in to this matter, it will be really helpful.
Also whom to contact to make OpenCL Windows version supports more than 4 GPU's?
Thank you.
dishwara
dishwara,
This place is appropriate enough. Although you can also file a Help ticket, but you will again meet me there 🙂
So as per your observations, you are able to run the bitcoin miners when using 4 or less GPUs but it instantly crashes when you try to run it with more GPU. Am I correct?
Also It appears you are able to run the application for more than 4 GPUs in linux, and the issue is only for windows.
Please provide the details of your system: CPU, GPU(s),SDK,Driver,OS.
Mother board MSI Big Bang Marshal , CPU Intel core i-5 2400, 3.1 Ghz, 2nd generation, Windows 7 Ultimate 64 bit OS, Catalyst 11.3, 11.4, 11.5 & even tried with 11.6 & included APP from 2.3 to 2.4.
4 nos.of MSI R6970 graphics card.
2 nos.of ASUS EAH5870 graphics card
1 no. of Sapphire HD 5870.
I connected all the 7 cards with extender cables like many others.
slot 1,2,4,5,6,8 with 1x extender & slot 3 & 7 with 16x extender.
same works fine in Linux, but not in Windows.
Miners works even 1x & also it doesn't need much memory & bandwidth.
All Bitcoin miners, more than 5000 users using AMD/ATI cards only as it gives better integer calculation than Nvidia's CUDA.
If you look this site, it will help much.
http://forum.bitcoin.org/index.php?topic=19038
Thank you,
Dishwara
Any workaraound for windows to work with 8 GPU's in OpenCL?
I also confirm the demand for the support of more than 8 GPUs in Linux or at the very least also 8 GPUs under Windows.
I don't know if it is overly hard to implement, but I doubt that you would need to change a huge pile of your source code for that.
Support for more than 4 GPU's in windows needed.
It seems in Linux only, more than 4 GPU's supported for OpenCL programming with AMD APP. Windows also needed support for more than 4 GPU's , at least 8 GPU support for OpenCL programming using AMD APP.
Originally posted by: dishwara Support for more than 4 GPU's in windows needed.
It seems in Linux only, more than 4 GPU's supported for OpenCL programming with AMD APP. Windows also needed support for more than 4 GPU's , at least 8 GPU support for OpenCL programming using AMD APP.
Let me correct this request: At least arbitrary number of GPU's using AMD APP. This is the 21st century. Let's not hardcode device number, supported by API or drivers.
Here is my list:
1. Global Variables. Its a pain to pass down parameters from the kernel function to every function that needs it.
2. Function Pointers.
3. Better support for complex data structures. Unified Address Space.
4. Templates.
5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.
6. Population count instruction. Can be used to implement blazingly fast warp wise prefix sum for binary digits. Algorithms like Radix sort would benifit greatly. Also useful for doing custom scheduling.
7. Multisampled texture support for OpenCL. Useful for Compute based deferred rendering. This can already be done in DirectCompute.
Debdatta Basu.
Originally posted by: debdatta.basu Here is my list:
4. Templates.
5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.
I too vote for templates and warp voting functions. Currenly have to implenet things like __all() and __any() via __local arrays and parallel reduction.
What about say something about simultaneous writes to same memory position.
like CUDA especification does:"It states that when when warp makes multiple writes to the same memory locations, at least one of the writes will sucessed. Which one that is in undefined."
It has never occured to me that I would've wanted to rely on something like that to work. And I would disencourage anyone from using methods like this. Puting statements such as:
if(get_local_id(0) == 0) {...}
is surely slower than letting all of the threads write the same variable and know that one of them will succeed. However writing code that "accidently" works on GPU is not portable code and most likely does not work on other vendor's cards, not to mention CPUs.
OpenCL is there to create a general layer of doing paralell computation. Going around specification is alright for short term hacking, but should (and can be!) avoided at all times.
Expose the BFI_INT instruction
The existing BFI_INT instruction is very powerful and speeds up some GPGPU tools by 10-30%. However developers have to use nasty hacks to make use of it because it is not exposed at all to CAL or OpenCL! This is a blatant mistake that could very easily be fixed by AMD. See how I hacked BFI_INT support in my whitepixel app: http://blog.zorinaq.com/?e=43