Archives Discussions

himanshu_gautam · ‎02-02-2011

Suggest Feature you want in AMD APP

Hi EveryBody,

I was preparing a top feature requests for the AMD's openCL implementation. I will be looking to as many old forum topics as i can. But there can always be fresh inputs from you so I have created this thread.

It would be nice if you also mention some key advantages of that feature. Obviously we cannot guarantee that every request will be fulfilled in SDK 2.4. But the important requests will be added to the roadmap of SDK and most probably implemented at some point of time AMD considers appropriate as per time lines and priorities.

I hope you willl grow it feircely.

Edit: Made post sticky.

laobrasuca · ‎09-22-2011

another suggestion would be: a sort function for vector types, like, int4 a_sort = sort_int4(a), with a_sort.x <= a_sort.y <= a_sort.z <= a_sort.w. I would be useful for kernels where only a small number of values need to be sorted (up to 16 values). And using low level sort on vector types would be way faster than running regular C code sort algorithm for arrays, specially in cases where the array is not stored in the private memory space due to the lack of memory space.

and a question: what's the main reason why it is not allowed to access vector type values with indices, like: int16 a; a[12] = 5; Is this related to performance? Would the new GNC architecture (and consequently the way compiler and runtime behave, i.e., hardware runtime instruction scheduling vs software compilation-time scheduling) make it viable?

k1942t · ‎09-18-2011

New ISA level instructions using bank conflict detection hardware.

p: uint pointer of LDS or GDS.

port_id: 0-31

bool lock_bankport(uint *p,int port) { uint tid=get_id_in_wavefront(); __local uint bank_is_used=0; bool ret=false; for (int i=0;i<4;i++) { if (0+16*i<=tid && tid <=15+16*i) { if (((bank_is_used >> port)&1)==0 && is_first_tid_in_port(port) //return true if tid is the first thread id of the same port // using bank conflict detection hardware. ) { ret=true; atomic_or(&bank_is_used,1<< port); } } } __barrier(); __local uint bank_is_used_old; if (tid==0) { atomic_or(p,bank_is_used); } __barrier(); if ((bank_is_used_old>>port)&1) { ret=false; } return false; } void unlock_bankport(uint *p,int port) { uint tid=get_id_in_wavefront(); __local uint bank_is_used=0; atomic_or(&bank_is_used,1<<port); __barrier(); if (tid==0) { atomic_and(p,~bank_is_used); } usage: __GDS sync_obj=0; __kernel void test(double* dest,uint* index,double* value) { uint gid=get_global_id(0); uint dest_index=index[gid]; uint port=dest_index&31; while (1) { if (lock_bank_port(&sync_obj,port)) { dest[dest_index]+=value[gid]; unlock_bankport(&sync_obj,port); } } }

MicahVillmow · ‎09-22-2011

laobrasuca,
a vector is a native data type like an integer, it is not an array of scalar types, so it is not index-able. This is not related to performance, but related to the nature of the data type. Asking to index into a vector is no different than asking to index into an 32bit-integer. It makes no sense from a hardware perspective or a language perspective. If you want index-able data, use arrays.

wateenellende · ‎09-29-2011

1) Better support for C++ containers and algorithms. Nvidia has a project for this for their products, forgot the name. Things like parallel transform() and sort() would be nice.

2) contribute to existing projects in stead of adding your own libraries.

For a large part, the main problem i've encountered is that there are 20 libraries to choose from when you try to do a task, be it BLAS or FFT or something similar, and each force you to commit to them so that it's hard to switch.

For the FFT for example, it would be nicer if support for AMD stream processors was contributed to the FFTW project. (Yes, there are legal issues there...)

Likewise, why not contribute to the Eigen project instead of providing another BLAS library... http://eigen.tuxfamily.org/

notzed · ‎10-07-2011

Add an option to the CPU driver to perform run-time memory address range validation. Unlike C this should be `easy' to add because of the tight memory model, and accurate.

Boy would that save some debugging time ...

arsdmthe · ‎11-06-2011

needed features are open drivers to be able to make each OS opencl enable (why not an opencl OS)

also as AMD do money with hardwares and not software why not making some mini-cluster with 4/8/12 fusion chips to do interesting opencl coding !

the price can stay unexpensive if AMD sell it itself !!!

rossryan · ‎11-24-2011

Hey Guys,

Could you add BigInteger support to the APP? I, and no doubt more than a few other people, would love to be able to use arbitrarily-sized integers in OpenCl.I do not care that it may be a vendor-specific extension.

Here's what I am talking about, if you're curious: http://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx

More than a few languages (Java, C#, etc.) have added support for BigInteger, and I have a strong feeling that I could make one of my programs much faster with this kind of support.

All the standard operations that programmers typically use (+, -, *, /, %, <<, >>, !, ^, &, |, etc.), and anything else you think of. Integer power and Xor.

Thanks,

-R

himanshu_gautam · ‎11-24-2011

rosssyan,

I am not sure how helpful biginteger would be in OpenCL Computing. The immutabability property is a big performance hitting feature.

notzd,

Can you explain your idea a bit more.

arsdmthe · ‎11-25-2011

i'm not sure a variable struct as it should be implemented

you better rework your code to avoid use it

rossryan · ‎11-26-2011

himanshu.gautam,

It would be extremely helpful in genetic analysis (my personal interest), cryptography, and a number of other special applications that could really be sped up by throwing as many video cards into a server as possible (as opposed to leasing time on a grid).

arsdmthe · ‎11-26-2011

Originally posted by: rossryan himanshu.gautam,

It would be extremely helpful in genetic analysis (my personal interest),

nice

cryptography, and a number of other special applications that could really be sped up by throwing as many video cards into a server as possible (as opposed to leasing time on a grid).

well forgive me but i don't understand

crypto need fixed algos to have a correct result, or you want to live break ssl without knowing the algo used so with your genetic knowledge you should infer the correct algo with size of strings used ?

for cluster work forgive me again but how is related bigint ?

maybe i'm stupid but from what i know you will not gain speed with the multi check that bigint usage imply, so can you explain more cause i dislike feel stupid

thanks

arsdmthe · ‎11-26-2011

me again !

about the no X : can we expect this for next year or we must live with all the noise from X related things (cpu,gpu)

no X is more work than bigint but it should make lot of people very happy to buy amd

rossryan · ‎12-04-2011

arsdmthe,

For cryptography,

The implementation of asymmetrical cryptographic schemes often requires the use of numbers that are many times larger than the integer data types that are supported natively by the compiler. In this article, we give an introduction to the implementation of arithmetic operations involving large integers. We do not attempt to give a full coverage of this topic since it is both complex and lengthy. For a more detailed treatment, the reader is referred to the listed references.

http://www.codeproject.com/KB/cs/biginteger.aspx

As for genetic programming, you're dealing with a lot of data that does not "chunk" into Int64s nicely. For instance, if I am performing something trivial, like datamining a large database for an algorithm to describe the data contained within (modeling), the resulting model, based simply off of genetic programming, could contain 30 or 40 Int64s with all sorts of complex logic, where a single BigInt might suffice. It ultimately is aimed at reducing complexity, and not having the mathmaticians scratching their heads as they try to untangle the mess into a simple algorithm that can be published. No one wants to wade through 300 variables when 12 might suffice.

Now, in the event of genetic analysis of something like the human genome, chromosones are quite large (when represented as data), which again, can be split into many chunks; enough chunks that maintaining relationships between chunks necessitates the creation of a large system just for that singular purpose. The problems in genetic programming and genomic analysis can be similar.

So, the algorithms, in the above cases, may be 'fixed' for a given run, but the data certainly is not. As such, "any-sized" BigInteger is of less complexity and headaches than "5 quintillian max" Int64s.

As for the grid comment, OpenCL can, with well-written software and enough video cards, provide somewhat comparable performance for a much smaller price. If I can order up a dozen servers with 5 6990s in each of them (somewhat unlikely, as these cards take up two slots, and server motherboards aren't big on x16 slots, so perhaps 3 of them per server), and have my work done for a small fraction of the cost of renting a grid...well, think about it...and I can reuse my servers later on for other projects, with only the cost of the electricity and eventual upgrades a consideration.

I'd like to go into more detail, but proprietary knowledge is proprietary knowledge.

*edit: Spelling and clarification. New glasses, and the prescription is already off...

arsdmthe · ‎12-06-2011

Originally posted by: rossryan arsdmthe,

For cryptography,

thanks

As for genetic programming, you're dealing with a lot of data that does not "chunk" into Int64s nicely. For instance, if I am performing something trivial, like datamining a large database for an algorithm to describe the data contained within (modeling), the resulting model, based simply off of genetic programming, could contain 30 or 40 Int64s with all sorts of complex logic, where a single BigInt might suffice. It ultimately is aimed at reducing complexity, and not having the mathmaticians scratching their heads as they try to untangle the mess into a simple algorithm that can be published. No one wants to wade through 300 variables when 12 might suffice.

you never tryed to map your data to something else than int ?

gpus are strong on image processing so why not work with data mapped to image ?

As for the grid comment, OpenCL can, with well-written software and enough video cards, provide somewhat comparable performance for a much smaller price. If I can order up a dozen servers with 5 6990s in each of them (somewhat unlikely, as these cards take up two slots, and server motherboards aren't big on x16 slots, so perhaps 3 of them per server), and have my work done for a small fraction of the cost of renting a grid...well, think about it...

well amd make nicer chips than gpu to make opencl 😉

i suggested to make small mini-clusters with 4/8/12 fusion chips to make a good use of openl !

if amd make some you will have no problem with opencl grid 😉

but proprietary knowledge is proprietary knowledge.

i strongly disagree : knowledge should be free and open !

thesmileman · ‎12-14-2011

I really really need support for 16bit single channel signed and unsigned ints (GL_LUMINANCE16UI), in interop. I do not know why this isn't part of the spec or at least mentioned as a suggestion as we have been using 16bit unsigned textures since the moment they were release. 16bit single channel values are a staple in a number of scientific task. Currently I have to rewite our OpenGL code to use 4 channels for everything and as expected.

NOTE: Please don't read the next comment as company X does this why don't you that is NOT the point of mentioning it as explained below.

I don't like to mention other vendors when requesting a feature but I wanted to use this as a reference point for a signifigant performance differental. I am noticing a signifigant performance boost from using GL_LUMINANCE16UI rather than a GL_RGBA16UI. I will also note that they other vendor doesn't say they support this I just happened to stumble upon it when I forgot to pad my OpenGL code in RGBA buffers.

Also yes I realise you can stack for single buffers in an RGBA buffer however that is inconvient and often times not what we need to do.

Also I have a performance need for 16bit integer atomics. I have found a number of algorithms improvements we can do to optimize our code using large 16bit integer values. We can pad those values into integer arrays and the code works just fine on the GPU however we have to transfer this data to and from the CPU so doubing its size creates an additional overhead that we really don't want. To be clear the performance benifit I am looking for here is in reducing the data transfer to and from the CPU. I am not expecting the actual atomic_inc calls to be faster.

Thanks,

Jim

Marix · ‎12-15-2011

Improved Register Reuse

I have some bandwith limited kernels that require larger data structures and therefore live quite on the edge regarding register requirements. Currently the amount of register used increases with increasing kernel length, even for kernels of the form "res += do(...); res += do(...)", where the second invocation should be able to completely recycle the registers of the first. The 12.12 driver is actually a regression in that, increasing scratch register of my main kernel by 4 and thereby performance of that kernel by more than 5%.

szabi_h · ‎12-15-2011

Dear, Himanshu Gautam! Thank you your answer, but my suggestion doesn't matter now. Thank you again, and excuse me, that i unnecessarily wasting your time in here.

himanshu_gautam · ‎12-17-2011

Marix,

A small testcase showing this behaviour would be very helpful.

szabhi_h,

can you please explain the suggestion. It seems related to user-defined data structures, but I am not able to understand it completely.

bubu · ‎12-21-2011

1. You did a good job implmenting static C++ and templates.... but we also need virtual methods. Without polymorphism it's hard to kill the C code-style 😄

2. Improve the SKA tool. I get tons of N/A and the kernel precompilation is not intuitive ( just COPY the excellent Intel's CL IOC tool! ).

3. Add a static-analysis tool so we can detect flaws in our kernel's code in a more-visual way than profiling.

4. We're still waiting for an AMD's equivalent of NVIDIA's Nsight visual debugger. Is the new gDebugger ready? What I basically need is to inspect variables and place some breakpoints inside VS2010, all visually.

5. We desperately need the AMD's equivalent of NVIDIA's CUDA's cudpp and thrust libraries, with official support ( not 3rd party, make that OFFICIAL ). Custom key-sort, reductions, parallel scan, random generators, etc... all optimized for your platform.

Marix · ‎01-12-2012

Originally posted by: himanshu.gautam Marix,

A small testcase showing this behaviour would be very helpful.

Seems the forum somehow lost my lengthy post pointing to my testcase, so this time it's just a link: https://github.com/theMarix/Register-Optimization/blob/master/SoA.cl

There is some description of the problem in the file. Basically what you can see is that with additional direction I include into the kernel register usage goes up.

nou · ‎01-12-2012

maybe compiler do not optimize register usage after you restrict a work group size as he have more registers to use.

also this are usages with sdk 2.6 and catalyst 11.12

dslash_eoprec_Cypress.isa MaxScratchRegsNeeded = 24 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_lim_group_size_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 72 dslash_eoprec_simplified_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 57 dslash_eoprec_simplified_loop_Cypress.isa MaxScratchRegsNeeded = 281 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_simplified_loop_noret_Cypress.isa MaxScratchRegsNeeded = 193 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_simplified_loop_nounroll_Cypress.isa MaxScratchRegsNeeded = 281 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_simplified_loop_unrolled_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_unified_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 57 dslash_eoprec_unified_2dirs_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 53 dslash_eoprec_unified_3dirs_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 55 dslash_eoprec_1dir_Cypress.isa MaxScratchRegsNeeded = 0 SQ_PGM_RESOURCES:NUM_GPRS = 59 dslash_eoprec_2dirs_Cypress.isa MaxScratchRegsNeeded = 4 SQ_PGM_RESOURCES:NUM_GPRS = 62 dslash_eoprec_3dirs_Cypress.isa MaxScratchRegsNeeded = 9 SQ_PGM_RESOURCES:NUM_GPRS = 62

Marix · ‎01-12-2012

This is correct. I only included the limited work group size variation for reference.

My issue is with the 3direction and 4direction variants using more registers than the two-direction ones, as each direction is an own function that only differs from the others by some substractions becoming additions and vice versa. Therefore the third direction should be able to reuse all registers from the first.

Good to see that the unified kernels stay without scratch register on 11.12. So I can finally upgrade my workstation.

realhet · ‎01-17-2012

I'd love to see CAL-OpenGl interoperability even on win32 or at least CAL+DX9 on win32.

yurtesen · ‎04-28-2012

I would like to be able to run OpenCL programs without running X at all by directly accessing the GPU. Of course I should still be able to run X+OpenCL also and if multiple cards, I should be able to do card0 X+OpenCL card1 OpenCL only...

allenwaye · ‎01-04-2014

#include "BigIntegerLibrary.hh"

i have found a biginteger library in c++, and i have used it in my amd bolt app on cpus, however it doesn't work on gpu, because opencl does not support biginteger or a custom user type.

i was trying using gpu to calculate Lucas Lehmer Test for Mersenne Prime Numbers.

tugrul_512bit · ‎01-12-2014

1)Is a SSE/AVX to Opencl converter possible? At least for GCN compute units? Maybe a cinebench kernel/context can be changed into opencl one dynamically so it can use GPU automatically in a dynamic parallellism manner. (an option in catalyst that forces SSE into a GPU-opencl thread so CPU cores can do other jobs at the same time, pci-e2.0 bandwidth can help on top of main memory bandwidth this way?)

2)Could you add a converter that inspects a (opengl/directx) context for some milliseconds for the render/draw commands then converts the context into opencl-rendering context? Can opencl rendering scale better than an opengl/directx version when number of cores are increased in a system? Is this worthy to try? Maybe not future-proof? Insanely hard? Thanks.

michaeltesch · ‎01-17-2014

Native support for Fedora! The guy who was maintaining the catalyst packages for Fedora for the last few years (he was doing it without ATI hardware even!) has stopped maintaining them as of Fedora 20.

This is a really bad (and lame) situation for scientific users, Fedora is way ahead of Ubuntu wrt many packages that I use for research, so is my preferred computing platform. But with the Catalyst package no longer being maintained, either I have to dig up an NVidia card somewhere, or do my HPC on a different platform.

It can't be that hard to learn how to make .rpm's, can it?

yurtesen · ‎01-18-2014

michaeltesch wrote:

Native support for Fedora! The guy who was maintaining the catalyst packages for Fedora for the last few years (he was doing it without ATI hardware even!) has stopped maintaining them as of Fedora 20.

This is a really bad (and lame) situation for scientific users, Fedora is way ahead of Ubuntu wrt many packages that I use for research, so is my preferred computing platform. But with the Catalyst package no longer being maintained, either I have to dig up an NVidia card somewhere, or do my HPC on a different platform.

It can't be that hard to learn how to make .rpm's, can it?

You can just install AMD Catalyst drivers manually if you need to. There is no need for RPMs. Just download from AMDs site and install. It is Fedora's responsibility if they want to make driver RPMs or not.

That said, because Fedora is an experimental OS and should not be used for production installations anyway. I was using Fedora for several years and every update constantly broke one or two things (and different stuff everytime). Especially graphics drivers, for both AMD and Nvidia got broken regularly at some point. We had machines becoming crippled too often with Fedora's random updating schemes. Then switched to Ubuntu and now things are working so much more smoothly.

There is a line between having the newest stuff laying around uselessly or having rather new stuff but still a stable system. At least I prefer having a somewhat stable system which works, instead of being a part of an unstable experiment.

christianbusch · ‎01-17-2014

Please implement a way to inspect LDS while debugging. My current work-around (a.k.a. hack) is reading the values of the LDS into variables. This is neither beautiful nor strictly correct, as I introduce new operations. Not being forced to go this way for debugging would be a big plus for me.

jross · ‎01-17-2014

1) Open source your free software that doesn't generate any corporate revenue. Provide a mechanism for developers to contribute.

2) See #1.

3) Provide a mechanism to access the cool features in the recently launched Kaveri APU. So, apparently, it has a cool hardware feature called "HUMA" that has been talked about for years, has tried to be emulated in hardware by other companies, and could enable an entire new set of applications for GPU acceleration. Unfortunately, the software guys didn't get the memo. Also, there's something called a "TrueAudio DSP" that is allegedly programmable. The software guys missed that memo too and there's no way to program it through OpenCL. But first, see #1 and #2.

ferdysan · ‎01-18-2014

hi, my girlfriend bought an asus pc with an amd ax with 4 core and a hd7600 graphic card but she use heavily (or better evily) to the weight of video conversion...she and i search very long for a sw that use the power of the gpu of the hd7600 but everywhere we find sw that use excusively CUDA...we find only a number you can count on one hand of sw that use amd, but the version of your sdk depends of the version and the model of the gc.. so in some way can you create something that universally is compatible or something that can simulate the sw CUDA you can say this is a diabolic suggest but if you can beat your enemy you can became his ally, who knows what you can gain?thanks

yurtesen · ‎01-19-2014

ferdysan wrote:

hi, my girlfriend bought an asus pc with an amd ax with 4 core and a hd7600 graphic card but she use heavily (or better evily) to the weight of video conversion...she and i search very long for a sw that use the power of the gpu of the hd7600 but everywhere we find sw that use excusively CUDA...we find only a number you can count on one hand of sw that use amd, but the version of your sdk depends of the version and the model of the gc.. so in some way can you create something that universally is compatible or something that can simulate the sw CUDA you can say this is a diabolic suggest but if you can beat your enemy you can became his ally, who knows what you can gain?thanks

It would be much better for Nvidia to make CUDA programs compatible with OpenCL devices and not other way around. CUDA only supports Nvidia hardware and is a dying technology. All the other companies (from Intel and FPGA manufacturers to ARM etc.) support OpenCL and even Nvidia devices support OpenCL. Many developers are switching from CUDA to OpenCL nowadays. So...

Meteorhead · ‎01-25-2014

CUDA is dying, but all too slowly. Hoping they get their act together on the OpenCL front however.

jross · ‎01-27-2014

Meteorhead wrote:

CUDA is dying, but all too slowly. Hoping they get their act together on the OpenCL front however.

CUDA won't die just like Fortran won't die. Some organizations refuse to rewrite code or the original author is gone and nobody else can change it. Also, let's look at the alternative...

OpenCL has already introduced and deprecated API calls between 1.0 and 1.2 (see clSetCommandQueueProperty/clEnqueueMarker/clEnqueueWaitForEvents/clEnqueueBarrier). Khronos also gave us calls for specific cases (clEnqueueTask) instead of just using the general cases (clEnqueueNDRange) while at the same time they gave us general cases (clCreateImage) and deprecated the specific cases (clCreateImage2D/clCreateImage3D). And they even gave us a standardized method for loading non-standard API calls (clGetExtensionFunctionAddressForPlatform). That last one makes no sense.

In the rush to get OpenCL 1.0 out the door in 2009 without any working implementation, Khronos also introduced calls that were missing required arguments (see clUnloadCompiler/clGetExtensionFunctionAddress). I believe it's a mistake to launch an API without an implementation of it. There ought to be an implementation of the proposed OpenCL 2.0 before Khronos commits to it.

The OpenCL C kernel language is also built specifically for GPUs that existed in 2009. The memory organization has also not been though out well to separate access or visibility of data from the memory location or hierarchy. What about a memory architecture like the Adapteva Epiphany multi-core which has "local" scratchpad memory that can be accessed globally by the other cores? What about an architecture like the AMD Kaveri where a pointer is a real pointer? Why do we need to pass around opaque buffer objects instead of pointers?

The kernel language also broke C where it never needed to break the language. It broke it by introducing non-standard initialization (ex. float4 foo = (float4)(1.0f, 2.0f, 3.0f, 4.0f);) and vector swizzling (ex. float4 foo, bar; foo = bar.yzwx;). This then required implementations to modify a compiler in order to meet the standard for these two stupid non-standard changes to the language. Then Khronos effectively chose LLVM and Clang (academic projects out of U. of Illinois) to be the winning compiler tools of choice by introducing SPIR. This copyleft-incompatible license effectively guaranteed that no device specific compiler code would become open source like all the device architectures GCC supports and the industry would be dependent upon hardware manufacturers for software updates and bug fixes.

The Khronos ICD is also really dumb. They should have just used dlopen, LoadLibrary, etc. But instead, they created a system that effectively locks out third-party/open source OpenCL implementations unless they've paid the Khronos membership fee in order to get access to the OpenCL call table. Almost nothing about OpenCL is actually "open". The standard is controlled by a small group of people and organizations and this call for community input might be the first time it has happened.

Instead of Khronos introducing a low-level API that gave the bare minimum to access devices like a driver, they gave us middleware that doesn't actually do a great job of being middleware. The 2.0 standard is incredibly bloated. The "OpenCL 2.0 Quick Reference Card" is something like 12 pages. And the entire specification is something like 700 pages across a few documents. Compare this to any other programming API that wasn't designed by committee. OpenCL has also introduced a third memory type (pipes) that look like the Brook+ streams in addition to buffers and (GPU-specific) images (texture memory). All this complexity limits its use to more specialized programmers. By comparison, all of the excess in OpenCL makes CUDA appear well thought out and efficient. Nvidia knows it and CUDA won't be going anywhere soon.

Meteorhead · ‎01-27-2014

CUDA being kept alive by legacy code... that's a real argument for choosing CUDA over alternatives.

While I agree with most of your arguments, stating that the evolution of the standard is not always reasonable, and that OpenCL 2.0 is overwhelming, some things need correction. Kaveri-like architectures will be able to do just what you said, SVMalloc namely. The pointer returned by the function can be placed in setKernelArg and it will preserve it's actual value in the kernel call.

Khronos not using a snapshot of another dll loading library is to make OpenCL selfconsistent. It is the same reason why Qt does not use CMake. CMake is just perfectly fine for building Qt apps, and could be used to build Qt itself. Instead they use qmake. Why? So that they don't depend on third-party to get what they want.

The reason why CUDA will disappear is the same reason why Glide disappeared (made by 3dFx). It was too specific. Now that the competition has caught up in HW capability (or more like surpassed NV), and alternative APIs are starting to become equally potent AND portable at the same time, CUDA is becoming less and less appealing. It is true that CUDA has it's merits (C++ let alone is a good starting point), but they consistently have a pricing policy that is outrageous to say the least. Serious computing centers are thinking about buying ****loads of GTX cards intead of Teslas, because true they don't have ECC, and DP is rougly twice as slow, so practically, if you buy twice as much GTX as you would Teslas, and run everything twice, then you get more computing power for the same amount of time and money. This is ridiculous.

NV also made the mistake that they started to support OpenCL, but gave up along the way. This sends out a really bad message. "Buy NV, start using it, then if you are lucky, in 2 years time it will still be supported". Either do, or not do, but never 'sort of' do.

So yes, OpenCL has it's flaws, C++AMP has it's flaws, AMD's OpenCL implementation definately has it's flaws, most of the aforementioned even have MAJOR flaws, but still... I see more future in them then CUDA. History does repeat iself. Anyone sensible starting a new project does not invest in CUDA, but something that stands on multiple pillars, and not just one company.

yurtesen · ‎01-28-2014

I think it is much more difficult to support so many different types of hardware and perhaps that is why nvidia is able to produce a more consistent feature set. Maybe OpenCL is not 100% open etc. either but CUDA is even more closed and proprietary. If for some reason for example AMD would stop producing GPGPU products, I can just keep on using another company's product. If I use CUDA and Nvidia goes bust, then I would be screwed. In the long run it is always bad for anyone to choose monopoly over competition even for one way or another it looks like a wise decision at the time. Also I do not like that Nvidia is not supporting OpenCL just to force people to use CUDA and avoid competition. I think it is very anti-competitive and hopefully people will realize this and do the right choice.

szabi_h · ‎01-20-2014

I would like to have two functions, that returns the device (and platform) index of the specified device in the multi-device environments:

GetAPUDeviceIndex() function result: the device index of the APU.

GetAPUPlatformIndex() function result: the platform index of the APU.

GetDefaultGPUDeviceIndex() function result: the index of the device which is connected to the monitor (generally the graphics systems use this -by default-, and sometimes it is not the 0. index).

GetDefaultGPUPlatformIndex() function result: the index of the platform which have a device which is connected to the monitor (generally the graphics systems use this -by default-, and sometimes it is not the 0. index).

metrix · ‎01-21-2014

Documentation, Easier installation: I have been trying to get aparapi examples working in Linux and have run into one issue after the other trying to get the examples working.

HSA GPGPU examples that can be used in the real world, examples where HSA could increasing large swaths of software with a few code changes so the world can see what GPGPU can do. Would Arrays and/or Hashes in Ruby/Python be a good place to start? Sorting in sqlite? I would like to see Examples that could be implemented further up the software stack that could help the common programmer without them needing to know/understand opencl.

aazmp · ‎02-02-2014

Need older OpenCL compilers available.

Maybe with an option to clBuildProgram(). Or maybe as standalone compiler applications. But there must be a big choice which to use (for example at this moment there hasn't been good compiler for over a year).

Also inlining ISA would be nice.

jacksonfurrier · ‎02-08-2014

Inline assembly to get full speed on the GCN machines