Archives Discussions

himanshu_gautam · ‎02-02-2011

Suggest Feature you want in AMD APP

Hi EveryBody,

I was preparing a top feature requests for the AMD's openCL implementation. I will be looking to as many old forum topics as i can. But there can always be fresh inputs from you so I have created this thread.

It would be nice if you also mention some key advantages of that feature. Obviously we cannot guarantee that every request will be fulfilled in SDK 2.4. But the important requests will be added to the roadmap of SDK and most probably implemented at some point of time AMD considers appropriate as per time lines and priorities.

I hope you willl grow it feircely.

Edit: Made post sticky.

Meteorhead · ‎02-02-2011

Free fglrx from the clutches of Xserver. (It might be driver related)

At the moment linux driver is very intimately related to Xserver, but HPC applications would welcome the fact if the drivers could be loaded without a graphical interface having to run in the backgroud. I am only mentioning this, because I would like to integrate OpenCL into a system which is SLC based (Scientific Linux CERN), and this OS is a MUST. It is a minimalistic Red Had distro with many useful libraries included for scientific use, but the GUI is extremely unstable, and the monolithic grid infrastructure features worker nodes with GUI-less SLC. It would simplify things if drviers could be loaded similar to NV cards. This would free VRAM also, desktop wouldn't occupy memory, plus desktop rendering wouldn't hinder the default driver. ATM default adapter (utilized for desktop rendering) has to be AMD, otherwise fglrx fails to detect AMD GPUs.

Meteorhead · ‎02-02-2011

Proper 5970/6990 support functioning as two independant devices.

It is clear that the differing strategies of monolithic die vs. sweet spot predestines AMD to create dual-GPU solutions if it wishes to compete in the GPGPU market and the high-end section of gaming graphics. I really like the fact AMD does not differentiate GPUs artifically by deteriorating GPU capabilities to sell the expensive versions for HPC computing (like GTX and Tesla differ in this sense, most significantly in DP performance). The fact however that in each generation the flagship on the AMD front is a dual-GPU solution, it would be nice to create proper software support for these pieces of hardware.

If one has two single GPU-cards connected via CrossFire, it can be disabled through software and both cards can be utilized properly. In my mind this can only be impossible in the case of 5970 cards if some things were hardwired into the HW that prevents the use of this software disabling of CF. Although 6990 is not out yet, I very much hope AMD did not design it the same way 5970 was done in this matter. I do not know how much extra performance can be squeezed out by hardwiring things, but whatever is done that differs from a regular CF connector is not worth all the aggravation that comes when creating software support. Gaming performance of these cards are vastly determined by game engine design and not by hardware aiding. I very much hope 6990 will be free of the 5970's illnesses, but if not, 7xxx MUST start right off with full support (IMHO) if AMD wishes to participate in the HPC market.

nou · ‎02-02-2011

documentation tocl_amd_popcnt

full DP support with cl_khr_fp64.

expose GDS counters. so we can have very fast append buffers and other goodies from fast global atomics.

and make this thread sticky on top.

Meteorhead · ‎02-02-2011

experimental C++ features as extension (templates and classes most importantly)

These features can be reused when OCL standard shifts from C99 to C++. It would be nice if AMD could take the lead in software support, even if by this little. We could start working ahead too.

nomac · ‎02-02-2011

A list of dependencies containing kernel and library versions instead of just 3 major distributions in their current version.

This would ease the deployment in many cluster environments where only custom and seldomly updated distributions are available.

ibird · ‎02-02-2011

Reduce latencies on kernel launching and syncronization point.

bubu · ‎02-02-2011

Pls, excuse my english ... and there are 17 suggestions for you from our team... Some are ATI-specifyc but others are pretty generic, I hope you don't mind. I have more but I really don't want to make the post too large and difficult to read...

0. Make your LLVM JIT compiler open source. People then can detect bugs or to suggest optimizations easier. Also, please tell Khronos to make the ICD trampoline's source code public

1. Get rid of the console window on clBuildProgramFromSource. Popping a console window simply does not look very professional.... although it's just an aesthetics problem really.

2. Solve the N/A in the SKA tool for large kernels. Without this, I simply cannot optimize properly. Our kernels ( we're using more than 3k lines ) always emit N/As. Very frustrating.

3. Modify the SKA tool to save precompiled kernels as the Intel's OIC tool does. This is good to reduce the kernel's loading times + to control better the optimizations/register pressure/compatibility across drivers + to protect your source code.

This is very important because if you ask some commercial apps's developers why they use CUDA they may say "We simply don't feel comfortable giving our kernel's source code with the app".

I personally like CUDA because kernel precompilation makes my code much more resistant and optimized across drivers ( imagine: FW 190 CL JIT compiler outputs 16 regs... but FW260 outputs 18 regs ruining my occupancy computations! That happens a lot btw )

4. Create a static code analyzer with two primary targets: to detect bugs/buffer overruns,etc (like cppcheck does with C++ code) and to detect performance problems ( for instance, emitting a warning message in case the code is not coalescing correctly the global memory ).

5. Make a visual debugger like NVIDIA's nSight ( although I would make it a portable standalone app instead tho ). I personally think that printf is not enough for debugging effectively. I'm interestered in placing breakpoints, inspecting variables and looking at the call stack. Also would be interesting to implement a "run to cursor" to skip several lines or to execute previous lines of code. Other problem of printf is that we use the CPU to debug but we can get sigthly different results from the GPU.

6. Allow us to use OpenCL from a Windows Service without having to log in as NVIDIA Tesla's drivers do. This is critical for computing clusters: the server guys simply don't like to log in the 10000 nodes ... Seriously, we need a way in windows to use OpenCL without having to be logged as user.

For linux, pls allow us to call OpenCL without having XWindows running ( although we could loose the GL interop ).

7. Improve compatibility with unix/linux distros. I have the impression you prioritize Windows/Mac but those OSs's presence in the HPC world is very small. Just see the official Top500 Supercomputer 2010 list...

http://www.top500.org/charts/list/36/osfam

( Windozeds, pls don't enter there unless you've a strong heart! )

On the other hand, a BSD or Solaris port won't neither hurt because there are a lot of databases and server apps that could expriment with OpenCL ...

8. Contact the most popular GPGPU apps/libs developers and offer them complete support and hardware. We need some practical examples / sucess histories / programs using GPGPU. Seriously, you need to be much more agressive. Much more. That not only should include commercial projects but also free/open source popular ones. This will become a direct sales increment for you just supporting a bunch of popular apps/libs.

Did I say "GPGPU app store"?. Nope, but to send a few emails asking "What you need to use OpenCL and to improve your performance and compatibility with ATI cards/AMD Cpus for your app XXXX?" to a bunch of key programmers can be very simple and effective.

9. Invest in training/learning. Improve the docs. Write books like NVIDIA's GPU Gems series and make world conferences like NVIDIA's GTC 2010. Btw, I also think you should include some of these PDF docs in the SDK directly instead of linking to your SDK docs's web page.

Also, add new examples ( which are always welcome ). For instance, I would like to see a simple and clear radix sort of integer key/value pairs.

Another interesting one: sort a transparent triangle soup and show it using GL/DX interop ( something like you did in your DX OIT demo, but much simplified ).

Other one: a simple triangle rasterizer.

10. We need some extensions or changes for the next CL spec: mipmapping, cube maps, images in structs/arrays of images of different size, C++/template support, function overloading and recursion, function pointers, dynamic memory allocation(yes, calloc/malloc/realloc inside kernels), explicit control of the memory caches(I want to specify manually what/when should be cached and what/when not), concurrent kernels/hyperthreading, device fission for GPUs, etc...

11. We need more **official** higher-level libraries like BLAS, FFT, reductions, qsort, STL containers, MD5/SHA1, etc... Making these libs open source will be fantastic so the people could improve them and find bugs.

Why I said "official". Well... of course these libs could be developed by a 3rd party... but will be fantastic to get them directly from you, optimized and tested for your specific platform, ready to use.

12. We need a way to control the GPU's priority and to abort time-consuming kernels. Let me put an example: imagine the user can only use one GPU, the one attached to the monitor. Imagine the GPU task is very intensive so the OS's GUI gonna lag a lot...

Almost all the artists I know want to continue using their computer while the render is computed in second plane. If a kernel takes all the GPU resources they won't be able to continue working until the CL task is done.

Perhaps you could add an extension to pass a value from 0 to 100 to a clXEnqueueNDRange() function to control the kernel execution's priority.

On the other hand, for a very time-consuming task that could not be done progressively or using multiple passes, we'll need a reasonable way to abort it having the watchdog disabled.

The ideal could be to make the GPU to support preemptive (or similar) multitasking and to add to the Catalyst's CCC a task-manager like the one Windows uses... but, I know, this can be very hard to implement.

13. Please, support dual-GPU cards better. You should support properly your flagship GPUs omg!

14. Remove the environment variable limitations. Let the user to allocate all the available memory if the app requires it ( the user knows what he's doing. An automatism not ). Also, increase the maximum 1D image size to CUDA's levels ( 1D texture==2^27 elements max ) allowing us to use "jumbo" 1D-linear textures.

15. Implement the clBuildProgramFromSource's optimizations options: -cl-relaxed-math, mad, neg zeros, etc... which ( I think ) are currently ignored. Also make sure your compiler is optimizing properly the const/restrict/register flagged variables.

16. CL Wrappers. Wrapper writers would need also the CL functions "typedefed" as you did with GL. I think Pyrit's programmers were asking for this ages and ages...

*clock*

On the other hand, I would like see some of the SDK examples written in other languages different than C/C++. For example, in Java, C#, VB, python, Objective-C, etc...

17. Memory virtualization. One of the problems with GPGPU is the relatively low quantity of RAM that the cards have. Perhaps a solution is to add a flag to the clCreateBuffer/Image to specify we want to use memory virtualization. In that way we could use buffers much larger than the card could store in the VRAM. You could page using the PCI bus or something similar to the x86 architecture does.

As the programmer will mark the buffer manually with that flag, he will be aware the performance problems that can occur due to the PCI-memory swapping.... but hey! if he wanted to use that it's because he accepts the trade of some speed as cost for using more memory than the card has.

And, now that I mention the word "virtualization" ( although in a completely different context ), you could also talk with Parallels, Xen, VMWare, etc... to define an optimal interface to use OpenCL across OSs.

Woah, seems I just wrote a book Zzzzzzzzzzz

Top ten suggestions win a Zacate APU MB as prize, right, Himanshu?

*present*

Meteorhead · ‎02-02-2011

Addition to GDS, GWS via a simple function call. Atomics can do the trick, but surely compiler gurus can create a better solution than me.

perhaad · ‎02-03-2011

+1 on the break away from X.....PLEASE

This would allow for infinitely easier usage of AMD GPUs on clusters by also removing the users owning the X server and such stuff. I understand that less money is made from GPU clusters than graphics but it would be really nice...

Ideally an AMD roll for Rocks clusters would be awesome

Another thing would be more accurate event handling....

laobrasuca · ‎02-03-2011

bubu, they've already worked the suggestion no 1 out

1. For me, improve OpenGL/OpenCL interoperability. As of now, lots of delay are introduced on the createaquire buffers step, specially when manipulating big buffers (> 256MB). Could we imagine OpenGL/OpenCL interoperability as GLSL, meaning, no need of acquiring/releasing buffers all the time? Would be nice to have a render mode where instead of calling vertex/geometry shaders, we would call OpenCL kernel, with proper input and outputs specific to rendering. At this date, OpenCL is more flexible than shaders (?), maybe someday OpenGL/OpenCL interoperability will disappear and will be replaced by improved shaders? It seems to me it's more of a conceptual matter than technical difficulty.

2. Pointers to buffers (or similar thing). I want to be able to treat hundreds or thousands of buffers with a single kernel call. This is maybe not an SDK problem, but more generally an OpenCL spec problem. If you can fight for it on the OpenCL committee with Khronos I would send you flowers

3. Get rid of the high kernel launch latency from GPU devices. Unfortunately, the 2.3 version has not completely addressed this point. Hope things can get improved. I think it is something related to the GPU hardware, as the CPU do not suffer from this problem. Maybe something related to the power management?

leo

douglas125 · ‎02-03-2011

I agree with the above especially regarding the CL/GL interop improvement, high-level libraries (FFT, BLAS, etc) and full fp64 support.

I'd like to add the following:

- Possibility to use GPUs without attached monitors using Windows - I'm still unable to use the GPU unless I plug a monitor or a dummy plug with a radeon 4970, is there any update I'm not aware of?

- Image processing toolkit - Set of optimized functions to import popular formats like JPS, TIF etc and videos;

EDIT

One more thing based on a forum post:

- GPU reboot - Some way to reboot the GPU when the CL code freezes, just like windows does with TdrDelay environment variables. I have in mind a command to be issued by the application when the user presses some combination of keys.

nou · ‎02-03-2011

overlaping transfer/computation. so we can transfer one buffer and compute from other.

timchist · ‎11-22-2012

Hi nou.

> overlaping transfer/computation. so we can transfer one buffer and compute from other.

do you know whether this has already been implemented?

Thank you

subaruwrc · ‎05-08-2014

The lack of overlapping compute/full-duplex streaming has been a deal-breaker for us. We regularly deal with data/matrices that are too big to fit onto a single GPU, unfortunately attempting to 'stripe' and stream the data onto one or more GPUs is slower than simply waiting for a multi-threaded version executing on CPUs.

Another major problem has been the inability to request more than a small percentage of the total available VRAM. For example, our 6GB FirePro cards are limited to effectively ~2GB, which renders them essentially useless, especially for a $5K GPU.

The Aparapi project also has multiple issues with OpenCL that need addressing.

rick_weber · ‎06-25-2011

Originally posted by: laobrasuca bubu, they've already worked the suggestion no 1 out

2. Pointers to buffers (or similar thing). I want to be able to treat hundreds or thousands of buffers with a single kernel call. This is maybe not an SDK problem, but more generally an OpenCL spec problem. If you can fight for it on the OpenCL committee with Khronos I would send you flowers

You can already do this. Just allocate a massive buffer called heap, create your own malloc function that runs in kernels using global atomics, and do everything as integer offsets into this buffer. You can then store the integer offsets wherever and index into the heap array in later kernels.

Pointers are just integer offsets from address zero 😉

laobrasuca · ‎06-25-2011

Originally posted by: rick.weber

You can already do this. Just allocate a massive buffer called heap, create your own malloc function that runs in kernels using global atomics, and do everything as integer offsets into this buffer. You can then store the integer offsets wherever and index into the heap array in later kernels.

Pointers are just integer offsets from address zero 😉

the point is that I need to have several different buffers, they are created/erased at will and one of the reasons for that is to control the amount of VRAM used. So what u describe is not a solution for me. I do really need more flexibility in this sense. Is the "opencL doesn't support pointer-to-pointer as input parameter" more like a conceptual thing? Hope support will be given sometime soon.

rick_weber · ‎06-25-2011

It's not a conceptual thing, but a real limitation. I know AMD GPU hardware can remap the pointers passed to a kernel by a cl_mem object between kernel calls. You can verify this by taking the pointer in a kernel, casting it to an int and saving it, then casting back to a pointer in a later kernel. You will not get the correct results. I still don't quite see why my solution won't work other than it's ugly; to make a pointer of pointers, just malloc an array of ints, each of which indexes into the heap.

laobrasuca · ‎06-26-2011

Originally posted by: rick.weber It's not a conceptual thing, but a real limitation. I know AMD GPU hardware can remap the pointers passed to a kernel by a cl_mem object between kernel calls. You can verify this by taking the pointer in a kernel, casting it to an int and saving it, then casting back to a pointer in a later kernel. You will not get the correct results.

so, can I create a bunch of cl_mem buffers (say, to store float values), create an int array on the host side with these addresses, pass it as kennel argument and, inside the kernel, access the data using something like:

__kernel test ( int *clbuffer_addresses)

{

float buffer_1_element_9 = *(clbuffer_addresses[1] + 9);

// or, i don't know

float buffer_1_element_9 = *(clbuffer_addresses[1] + 9*sizeof(float));

}

since clbuffer_addresses[1] represents the address of a buffer in the VRAM heap and 9 or 9*sizeof(float) the offset in this heap? Or things are more complicated than that? Does clbuffer_addresses[1] represent an address in the VRAM heap anyways? Or it's a pre-address that will be translated (at some point) to the actual address?

. I still don't quite see why my solution won't work other than it's ugly; to make a pointer of pointers, just malloc an array of ints, each of which indexes into the heap.

From what I understood, your solution would require to allocate a big and fixed space on the VRAM, at least the maximum amount of space I would require. I'm not telling that this would not work, I'm just saying that this would consume more VRAM than what I would need in some circumstances. Sometimes I need hundreds of buffers and sometimes only a few, and it changes in the application runtime, I can not predetermine the amount of buffers required as it depends on different parameters which change as the user interact. Controlling VRAM usage is my priority, that's why I can't keep unused space into VRAM.

fpaboim · ‎02-12-2011

5. Make a visual debugger like NVIDIA's nSight ( although I would make it a portable standalone app instead tho ). I personally think that printf is not enough for debugging effectively. I'm interestered in placing breakpoints, inspecting variables and looking at the call stack. Also would be interesting to implement a "run to cursor" to skip several lines or to execute previous lines of code. Other problem of printf is that we use the CPU to debug but we can get sigthly different results from the GPU.

That and global sync and i i'm good

morganritchie · ‎07-02-2011

Thanks for this informative piece! cheers..;)

morganritchie · ‎07-02-2011

Thanks for this informative piece! cheers..;)

fcorreia · ‎02-09-2011

Hi everyone,

I completely agree with everyone suggestions, but i think that freeing the fglrx from the Xserver should be a top priority.

For example how will APP SDK will run on fusion, will AMD force every vendor to use Xserver? Now with distributions (ubuntu) switching to other X Display server's (Wayland), Xserver may stop being the standard.

Support for the standard cl_khr_fp64 should be preferred to the vendor cl_amd_fp64.

Overlapping computing with memory transfer is no only a must for the SDK but a excellent feature to the standard.

Jawed · ‎02-11-2011

I'd like a compiler that doesn't prioritise speed of compilation over quality. Perhaps some kind of option?

What I see repeatedly is that the compiler produces absurd GPR allocations. Some of these are clearly cases where no global GPR allocation tuning is performed. The compiler "just gives up" because the code is "too long".

At some point you have to get out of the JIT mind set. Some of us compile a kernel once and then run the kernel for hours. Or days.

sarobi · ‎11-25-2014

Meteorhead wrote:

Free fglrx from the clutches of Xserver. (It might be driver related)

At the moment linux driver is very intimately related to Xserver, but HPC applications would welcome the fact if the drivers could be loaded without a graphical interface having to run in the backgroud. I am only mentioning this, because I would like to integrate OpenCL into a system which is SLC based (Scientific Linux CERN), and this OS is a MUST. It is a minimalistic Red Had distro with many useful libraries included for scientific use, but the GUI is extremely unstable, and the monolithic grid infrastructure features worker nodes with GUI-less SLC. It would simplify things if drviers could be loaded similar to NV cards. This would free VRAM also, desktop wouldn't occupy memory, plus desktop rendering wouldn't hinder the default driver. ATM default adapter (utilized for desktop rendering) has to be AMD, otherwise fglrx fails to detect AMD GPUs.

I second that, it's on the top of my wish list.

d_a_a_ · ‎02-18-2011

Originally posted by: Meteorhead

Free fglrx from the clutches of Xserver.

Proper 5970/6990 support functioning as two independant devices

+1

Originally posted by: ibird

Reduce latencies on kernel launching and syncronization point.

+1

Originally posted by: bubu

0. Make your LLVM JIT compiler open source.

14. Remove the environment variable limitations.

+1

Originally posted by: douglas125

- Possibility to use GPUs without attached monitors

+1

Particularly, I would add the following:

- Fix the current bugs.

- It would be nice if the AMD developer team could release incremental bug-fix versions of the APP SDK (x.y.z?) as well make available alpha/beta/RC releases of the upcoming SDKs. In my opinion the three months release schedule is too much for an immature and fast evolving technology.

- What about releasing the whole APP SDK under a Free Software license and then opening its development? I think that the entire development process could be greatly accelerated by taking advantage of the power of the community.

emuller · ‎02-18-2011

1a) Persistant registers for long numerical solvers which must transfer control to the CPU from time to time.

--or--

1b) Device can trigger memory transfers to host, and events that the host can act on ... while the device remains in the solver loop.

2) Headless X-less support in linux drivers

3) Multi-GPU support in OpenCL ... until then I'm stuck with CAL/IL

4) Multi-GPU interleaving host<->device transfers with compute

5) GDS, double its size each generation.

6) Fusion device that can keep up with a 6970, but with a unified memory space with the CPU.

himanshu_gautam · ‎02-21-2011

Whoa

Now that is a amazing effort from the developer community waiting to use openCL for serious purposes. It is great to see such large and illustrative feedback. Thank you all for your time and efforts.

settle · ‎06-24-2011

Originally posted by: d.a.a.
Originally posted by: Meteorhead

Free fglrx from the clutches of Xserver. Proper 5970/6990 support functioning as two independant devices
+1

-1

Proper 5970/6990 support functioning as one unified device similar to how my Dell T5500 Workstation with dual socket Xeon E5630s show up as a single device of CL_DEVICE_CPU_TYPE +2

settle · ‎06-27-2011

Originally posted by: settle
Originally posted by: d.a.a.
Originally posted by: Meteorhead

Free fglrx from the clutches of Xserver. Proper 5970/6990 support functioning as two independant devices
+1

-1

Proper 5970/6990 support functioning as one unified device similar to how my Dell T5500 Workstation with dual socket Xeon E5630s show up as a single device of CL_DEVICE_CPU_TYPE

+2

Also, if and when AMD makes Opteron APUs, please make a multi-socket system visible as a single CPU device and a single GPU device. If you can do it for the CPUs then it seems not too far fetched to request this for the GPUs as well. Thanks!

island · ‎06-28-2011

Newbie here.

Assembler support for 5xxx/6xxx.

I've developed signal processing applications for another manufacturer's GPU chips. About half of the code, all of the important kernels, were written in assembler.

I've just dropped by to investigate AMD GPUs and see whether I can do something similar, but dissappointed to learn that I can't get closer than IL (though still faintly hoping I've misunderstood). Without proper assembler support, I wouldn't even bother to try to use these processors.

dragonxi4amd · ‎02-23-2011

Command and control (C2)

At the moment OpenCL offers only the queue mechanism for C2.

Please, add the pipe mechanism (at least).

We have used both pipe and queue for command and control between processes and threads in multiprocessing/multi-core programming in CPU side. It has proved to be efficient - queue to put/get data and pipe to send/receive command and control messages.

At the moment the OpenCL approach is more for one CPU and one GPU.

Please, add more support for multiprocessing/multi-core programming.

We have found that while multiprocessing/threading improves multi-core usage it also improves the whole system performance and gives resources for more tasks - in OpenCL case it would boost CPU side performance.

himanshu_gautam · ‎02-23-2011

dragonxi4amd,

I guess this needs to get through khronos first. So this post is more benificial at khronos forums.

laobrasuca · ‎02-23-2011

I've one more suggestion. It concerns the Profiling plug-in for MS Visual Studio. More precisely, the stream session list tool-bar. It would really be nice if we could be able to delete ONLY ONE of the listed sessions instead of ALL! Or, it would be nice if we could select several lists holding the crtl bottom and erase all selected sessions. A second thing: i'd love to change the name of the session. Session1, 2, 3... is not really easy to follow when you have lots of sessions listed. It would be nice to have something like a right-click over the session name to bring up a panel option with rename, delete, save, ... By the way, a save option where we could group several session lists (using crtl, for example), and all the sessions would be saved in a same excel alike file in different sheets would be a nice touch too.

... not really a big deal, but would be very handy

Meteorhead · ‎03-01-2011

I will not be the heretic to copy-paste the feature-list of the new CUDA SDK 4.0, but let me post a link for those who are really curious.

CUDA 4.0 RC SDK

Some have been mentioneBut any, or things very similar, but anyhow, let me list the most useful ones that could be 'easily' implemented in OpenCL:

Share GPUs across multiple threads

Use all GPUs in the system concurrently from a single host thread

C++ new/delete and support for virtual functions

Thrust library of templated performance primitives such as sort, reduce, etc.

Unified Virtual Addressing

I do not know if Virtual Addressing means access of other device's memory or allowing the creation of memory objects larger than VRAM, but I'll definately find that out.

I know some features cannot be implemented becuase of limitations of OpenCL C99 language, but most I believe are not impossible. These new funcionalities alone are great leaps toward better usebility, but alltogether... I'm sure we would all welcome similar great leaps for APP SDK 2.4.

I was suprised to see GWS was not amidts the new features, although Fermi is stated to be GWS-capable and it is an eagerly anticipated feature on the green side of the force also.

My intention with the post was not advertising, just to show that pressure is great, and serious features are required to keep up the pace of the competitors. (Intel with AVX instruction set support for CPU compilation)

Starglider · ‎03-01-2011

Originally posted by: Meteorhead I will not be the heretic to copy-paste the feature-list of the new CUDA SDK 4.0, but let me post a link for those who are really curious.

The direct GPU->GPU memcopy, without having to go through host memory is awesome. However this feature would be useless in OpenCL without having reliable, performant multi-GPU support first! This is yet more motivation to switch back to CUDA as the app I am working on would benefit significantly from GPU->GPU DMA.

barnescj · ‎03-01-2011

Is there any information (or rumors?) on what is currently planned for v2.4? Also, what is the intended release date?

Thanks

Chris

Meteorhead · ‎03-01-2011

Opening a topic for this single post is unnecessary, and forum moderators should feel free to delete it once read (and hopefully conveyed to the higher-ups that are aimed).

This topic has shown really nice how greatly new features are anticipated and how badly people want to use OpenCL in the most varying fields of computing (starting from HPC research through video processing up to game development). Current SDK developers do not have the resources to implement all these features in a reasonable amount of time, neither do they have the time to cope with even just half of it. At the pace at which features are implemented at the moment (which is very impressive, I never would've thought OpenCL would develop this fast), it would take at least 2 years until all of this is introduced into the API.

Not to mention some functionalities require serious driver alterations also. Anyhow, I would suggest to reallocate (or hire) developers to be able to satisfy the great interest in using AMD HW for serious stuff. I believe the main setback of OCL compared to CUDA is robustness. OCL is not stable enough to safely develop enterprise software that is truly cross-vendor and cross-platform.

Anyhow, if you find this reply is too far away from the intention of this topic, feel free to moderate it out. (But conveying the idea would be nice)

Cheers,

Máté

nou · ‎03-02-2011

include output of CLInfo from all supported GPUs.

laobrasuca · ‎03-02-2011

Originally posted by: nou include output of CLInfo from all supported GPUs.

and correct the "bug" of the current CLInfo implementation, where if opencl 1.1 macro is defined, all the opencl 1.0 platforms will crash because some info are not available. it should first verify the opencl version of the platform (when you have more than one) instead of assuming that all platforms have the same opencl version.