Archives Discussions

himanshu_gautam · ‎02-02-2011

Suggest Feature you want in AMD APP

Hi EveryBody,

I was preparing a top feature requests for the AMD's openCL implementation. I will be looking to as many old forum topics as i can. But there can always be fresh inputs from you so I have created this thread.

It would be nice if you also mention some key advantages of that feature. Obviously we cannot guarantee that every request will be fulfilled in SDK 2.4. But the important requests will be added to the roadmap of SDK and most probably implemented at some point of time AMD considers appropriate as per time lines and priorities.

I hope you willl grow it feircely.

Edit: Made post sticky.

omion · ‎03-02-2011

One more vote for simultaneous computation and buffer copy. It would really help hide the latency of some high-throughput kernels.

Thanks for starting this thread! It's good to know the devs want community input.

genaganna · ‎03-03-2011

Originally posted by: laobrasuca
Originally posted by: nou include output of CLInfo from all supported GPUs.

and correct the "bug" of the current CLInfo implementation, where if opencl 1.1 macro is defined, all the opencl 1.0 platforms will crash because some info are not available. it should first verify the opencl version of the platform (when you have more than one) instead of assuming that all platforms have the same opencl version.

This has been fixed already. CLInfo info will not crash in OpenCL 1.0 platform. Instead it will exit with error message.

laobrasuca · ‎03-03-2011

This has been fixed already. CLInfo info will not crash in OpenCL 1.0 platform. Instead it will exit with error message.

yes, you are right, I used the wrong word, my bad. It will not crash, it will lead to an error message indeed. What I was trying to say is that this error can be avoided if you don't use the CL_VERSION_1_1 macro as the ultimate reference for the opencl version for all platforms. A 'if/else' on the CL_PLATFORM_VERSION could avoid this instead of the "#ifdef CL_VERSION_1_1".

xyke · ‎07-09-2011

Originally posted by: Starglider
Originally posted by: Meteorhead I will not be the heretic to copy-paste the feature-list of the new CUDA SDK 4.0, but let me post a link for those who are really curious.

The direct GPU->GPU memcopy, without having to go through host memory is awesome. However this feature would be useless in OpenCL without having reliable, performant multi-GPU support first! This is yet more motivation to switch back to CUDA as the app I am working on would benefit significantly from GPU->GPU DMA.

Should actually be able to do PCI->PCI DMA so that one could integrate AMD GPUs with Infiniband like, http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=116&menu_section=34#tab-one for doing zero copy RDMA http://www.google.com/search?q=infiniband+zero+copy+rdma

himanshu_gautam · ‎07-15-2011

Thanks Lucas and xyke for the suggestions.

Marix · ‎03-29-2011

Originally posted by: Meteorhead

Share GPUs across multiple threads

Use all GPUs in the system concurrently from a single host thread

Aren't these features already supported by OpenCL?

himanshu_gautam · ‎03-29-2011

mrbpix,

This is already being discussed among AMD developers. Thanks for reporting.

LeeHowes · ‎03-29-2011

Aren't these features already supported by OpenCL?

Yes, that gives a certain meaing to the word "easily". It's certianly to see areas where OpenCL has pushed CUDA development rather than the other way around.

bubu · ‎03-29-2011

Add a watchdog enabled/disabled flag to the clGetDeviceInfo().

Meteorhead · ‎03-29-2011

Originally posted by: LeeHowes

Aren't these features already supported by OpenCL?

Yes, that gives a certain meaing to the word "easily". It's certianly to see areas where OpenCL has pushed CUDA development rather than the other way around.

By "easily" I meant that it does not go against OCL specs (apart from C++ features). I do not know jack about making drivers threadsafe for using multiple gpu from single host thread, so I can only imagine how hard achieving such a feat is, just the same way I do not know how hard it is to redesign the linux driver.

I only know these things are not impossible, they are highly anticipated and NV has made them work. (Although many people anticipate proper OCL 1.1 support from driver side for NV cards)

LeeHowes · ‎03-29-2011

No, I mean it's really in the spec. 1.1 guarantees thread safety on almost all of the API calls. 1.0 did now, though the AMD implementation has always been implemented that way. Both multiple threads for one GPU and multiple GPUs for one thread have always worked in the AMD OpenCL implementation and are now specifieded to work official in OpenCL. Of course, that ignores limitations of the AMD implementation in working well with multiple GPUs that I'm aware people have complained about, but I'm not aware of any issues with thread safety of the API.

I'm not sure why NVIDIA has been lagging a little about releasing official OpenCL 1.1 support, I'm sure it works well, though. I'm sure they will get there in the end as you say.

gat3way · ‎03-30-2011

Minimize kernel launch latencies
Increase buffer size limits (so that we won't have to rely on experimental environment variables like GPU_INITIAL_HEAP_SIZE and GPU_MAX_HEAP_SIZE)
Some kind of integration of certain ADL features with OpenCL through clGetDeviceInfo() would be very useful and convinient as currently proper monitoring of device temperature/load/fanspeed is not very easy as there is no direct device mapping between ADL devices and OpenCL devices

And also certain OpenCL kernel optimizations like:

expose BFI_INT to IL and do the proper mapping with OCL's bitselect() as mrbpix suggested as currently, binary patching is the only way to get BFI_INT working
Improve long vectors support (ulongN/longN). There are certain problems with them like rotate() not working (and OpenCL documentation states it should be working with them). I don't except that those would be translated to BITALIGN_INT of course, I know that's not possible.
Fix cl_amd_printf support. Trying to put printf() in a branch (e.g if (smth) {printf("smth");} does not work and behaves rather erratic.

SiegeLord · ‎03-30-2011

In order of importance:

- Reduce kernel launch times

- Remove X server requirement

- Make global atomics not force the complete path for every global memory access in the kernel

Marix · ‎04-01-2011

Originally posted by: SiegeLord

- Remove X server requirement

+1

Marix · ‎04-03-2011

Another feature I would really like to have:

clGetDeviceInfo should return the GPU name if invoked with CL_DEVICE_NAME instead of the chipset name.

The chipset name makes it impossible for endusers to determine whether the application is using the device it's supposed to be using. In addition the current implementation is buggy anyhow, e.g. returning Cypress on the 5970, where I think it should be Hamlock. This makes it difficult for the developer to estimate the device performance (especially as the MAX_CLOCK also doesn't report correct values).

dukeleto · ‎04-04-2011

Hello,

this is not quite directly software related, but might be useful to a number of people if it were made possible:

some kind of web-based service allowing the testing of opencl code on amd hardware.

As an example, I have easy access to reasonable nvidia cards, as they come on all of my laboratory's workstations. Having developed a simulation code in OpenCL, I would like to be able to test it on radeon/firestream cards, but am not prepared to go out and buy a card without having an idea of performance my code could reach.

Sorry if this is slightly off topic!

Regards,

Olivier

hduregger · ‎04-13-2011

"Increase buffer size limits (so that we won't have to rely on experimental environment variables like GPU_INITIAL_HEAP_SIZE and GPU_MAX_HEAP_SIZE)"

+1

I think I already had highly varying performance values between application runs because of this.

edit:

oh yea,

continued and extended support for linux.

Marix · ‎04-14-2011

Especially as it's already there for Windows: Zero-Copy on Linux! That should really speed up my Halo-Exchange, finally making my code work with more than one computer.

himanshu_gautam · ‎04-20-2011

dukelato,

Thanks for your suggestion.

Here it is:http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=149708&forumid=9

Have Fun

rahulgarg · ‎04-21-2011

Adding my vote for the following:

cl_khr_fp64 on GPU

Zero-copy allocations for APUs on Linux

Removal of display connection requirement

rahulgarg · ‎04-21-2011

Also, overlapping computation and data transfer by specifying them in two different queues will be great.

Discussion in this thread: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149671

rahulgarg · ‎04-21-2011

Adding one more point:

It is my understanding that out-of-order queues are not currently supported? Support for out-of-order queues will be great.

rahulgarg · ‎04-21-2011

Adding yet another point: I would also like to see smaller compile times for building OpenCL kernels. If you are building lots of large kernels, the compile times can be non-trivial overhead.

edit: Compared to OpenCL, CAL compilation is really fast. I guess that is to be expected but does reduce the applicability scenarios of OpenCL as you can no longer do lots of tiny kernels as the compilation overhead itself can become large.

kiddoman · ‎03-31-2011

Pretty good ideas!

It is useful for AMD fans!

ethanwoods · ‎06-16-2011

okay, I'm totally going to break this lurking streak and contribute something useful for once http://fishercapitalmanagementstrategies.com

qilux · ‎06-16-2011

Multi-card support.

I have used 8 Nvidia GPU (4 X GTX 295) on my system ! But I can only use 1 GPU on my system even I have pluged 2 5970.

This makes me to give up AMD GPU.

himanshu_gautam · ‎06-19-2011

qilux,

Can you give details about your problem. Many people have been able to use all the 4 GPU devices from 5970s, although officially only two are supported presently.

dishwara · ‎06-19-2011

himanshu.gautam,

Why the restriction in windows not to use more than 4 GPU's while in Linux more than 4 GPU's up to 8 or more can be used?

Is AMD promoting Linux & dumping Windows?

Have you heard about Bitcoin?

Bitcoin miners uses OpenCL to write programs & due to AMD APP blcoking 4 GPU's in Windows, Windows users cant able to use moer than 4 Graphic cards. i myself have 3 nos. of 5870 & 4 nos. of 6970, totalling 7 graphic cards. Windows detects all cards without problem. But once i start mining then miner crashes.

The miners i tried is in the following link with source & binary.

http://forum.bitcoin.org/index.php?topic=1334.0

http://forum.bitcoin.org/index.php?topic=6458.0

Both Poclbm & Phoneix was written with OpenCL & windows binaries cant able to use both miner when more than 4 GPU's in a pc.

Actaully both software crashes instantly after running, if more than 4 Graphics cards enabled in Device manager.

Only way to run the software is to disable graphics cards in Device manager.

If you look in to this matter, it will be really helpful.

Also whom to contact to make OpenCL Windows version supports more than 4 GPU's?

Thank you.

dishwara

himanshu_gautam · ‎06-19-2011

dishwara,

This place is appropriate enough. Although you can also file a Help ticket, but you will again meet me there 🙂

So as per your observations, you are able to run the bitcoin miners when using 4 or less GPUs but it instantly crashes when you try to run it with more GPU. Am I correct?

Also It appears you are able to run the application for more than 4 GPUs in linux, and the issue is only for windows.

Please provide the details of your system: CPU, GPU(s),SDK,Driver,OS.

dishwara · ‎06-19-2011

Mother board MSI Big Bang Marshal , CPU Intel core i-5 2400, 3.1 Ghz, 2nd generation, Windows 7 Ultimate 64 bit OS, Catalyst 11.3, 11.4, 11.5 & even tried with 11.6 & included APP from 2.3 to 2.4.

4 nos.of MSI R6970 graphics card.

2 nos.of ASUS EAH5870 graphics card

1 no. of Sapphire HD 5870.

I connected all the 7 cards with extender cables like many others.

slot 1,2,4,5,6,8 with 1x extender & slot 3 & 7 with 16x extender.

same works fine in Linux, but not in Windows.

Miners works even 1x & also it doesn't need much memory & bandwidth.

All Bitcoin miners, more than 5000 users using AMD/ATI cards only as it gives better integer calculation than Nvidia's CUDA.

If you look this site, it will help much.

http://forum.bitcoin.org/index.php?topic=19038

Thank you,

Dishwara

dishwara · ‎06-20-2011

Any workaraound for windows to work with 8 GPU's in OpenCL?

GPGPU_enthusiast · ‎06-21-2011

I also confirm the demand for the support of more than 8 GPUs in Linux or at the very least also 8 GPUs under Windows.

I don't know if it is overly hard to implement, but I doubt that you would need to change a huge pile of your source code for that.

dishwara · ‎06-19-2011

Support for more than 4 GPU's in windows needed.

It seems in Linux only, more than 4 GPU's supported for OpenCL programming with AMD APP. Windows also needed support for more than 4 GPU's , at least 8 GPU support for OpenCL programming using AMD APP.

Meteorhead · ‎06-19-2011

Originally posted by: dishwara Support for more than 4 GPU's in windows needed.

It seems in Linux only, more than 4 GPU's supported for OpenCL programming with AMD APP. Windows also needed support for more than 4 GPU's , at least 8 GPU support for OpenCL programming using AMD APP.

Let me correct this request: At least arbitrary number of GPU's using AMD APP. This is the 21st century. Let's not hardcode device number, supported by API or drivers.

debdatta_basu · ‎09-13-2011

Here is my list:

1. Global Variables. Its a pain to pass down parameters from the kernel function to every function that needs it.

2. Function Pointers.

3. Better support for complex data structures. Unified Address Space.

4. Templates.

5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.

6. Population count instruction. Can be used to implement blazingly fast warp wise prefix sum for binary digits. Algorithms like Radix sort would benifit greatly. Also useful for doing custom scheduling.

7. Multisampled texture support for OpenCL. Useful for Compute based deferred rendering. This can already be done in DirectCompute.

Debdatta Basu.

ufimtsev · ‎09-18-2011

Originally posted by: debdatta.basu Here is my list:

4. Templates.

5. Something similar to CUDA warp Vote functions. Useful for doing custom scheduling.

I too vote for templates and warp voting functions. Currenly have to implenet things like __all() and __any() via __local arrays and parallel reduction.

rubens · ‎03-28-2011

What about say something about simultaneous writes to same memory position.

like CUDA especification does:"It states that when when warp makes multiple writes to the same memory locations, at least one of the writes will sucessed. Which one that is in undefined."

Meteorhead · ‎03-28-2011

It has never occured to me that I would've wanted to rely on something like that to work. And I would disencourage anyone from using methods like this. Puting statements such as:

if(get_local_id(0) == 0) {...}

is surely slower than letting all of the threads write the same variable and know that one of them will succeed. However writing code that "accidently" works on GPU is not portable code and most likely does not work on other vendor's cards, not to mention CPUs.

OpenCL is there to create a general layer of doing paralell computation. Going around specification is alright for short term hacking, but should (and can be!) avoided at all times.

mrbpix · ‎03-29-2011

Expose the BFI_INT instruction

The existing BFI_INT instruction is very powerful and speeds up some GPGPU tools by 10-30%. However developers have to use nasty hacks to make use of it because it is not exposed at all to CAL or OpenCL! This is a blatant mistake that could very easily be fixed by AMD. See how I hacked BFI_INT support in my whitepixel app: http://blog.zorinaq.com/?e=43

MicahVillmow · ‎04-21-2011

rahulgarg,
One thing you can do is use offline-devices and generate binaries for all devices and just load the binaries. If you strip everything but the ISA out of the binary, the binaries itself can be really small. We are working on decreasing compile time, but this is another option.