Archives Discussions

himanshu_gautam · ‎02-02-2011

Suggest Feature you want in AMD APP

Hi EveryBody,

I was preparing a top feature requests for the AMD's openCL implementation. I will be looking to as many old forum topics as i can. But there can always be fresh inputs from you so I have created this thread.

It would be nice if you also mention some key advantages of that feature. Obviously we cannot guarantee that every request will be fulfilled in SDK 2.4. But the important requests will be added to the roadmap of SDK and most probably implemented at some point of time AMD considers appropriate as per time lines and priorities.

I hope you willl grow it feircely.

Edit: Made post sticky.

jross · ‎04-22-2011

Some of us don't mind long kernel compile times if the runtime takes more than a couple minutes. We would rather have optimized binaries. Some kind of -O2 or -O3 compiler option would be appreciated.

Edit: @MicahVillmow below, Awesome! I'll check it out.

MicahVillmow · ‎04-22-2011

jross,
Support for -O0 -> -O3 I believe is in the current release, but only O3 is currently supported and O1 and O2 map to O3. This will change in the future releases.

Meteorhead · ‎04-22-2011

Yes, we really don't mind long compile times. Infact it has been stated that people (in SDK 2.2, so a long time ago) found that the if the kernel surpassed a certain length, the compiler seemed to give up optimizing GPR usage and started to use Scratch registers excessively.

Since some applications run for 2-8 days, I rerally wouldn't mind even if it compiled for 1 minute, if it can produce a 5% speedup. Having more compile optimiztion levels is really useful (might consider creating an "über" optimization level, completely disregardful of compile time).

Jawed · ‎04-23-2011

In my opinion, having worked with IL programming as well as OpenCL, some of the compilation problems/inefficiencies we encounter are purely in the compilation from IL to binary.

Do any of the compilation options apply to the IL->binary compiler? If not, they need to.

s58000 · ‎04-23-2011

concurrent kernel execution would be a huge improvement in my opinion.

MicahVillmow · ‎04-25-2011

Jawed,
We are taking that aspect into account when supporting these features. The IL->binary compiler was originally designed to only support one mode, graphics centric -O3, so it requires a redesign which is why it has not been exposed before.

bubu · ‎04-25-2011

About compile times: I understand that CL realtime compilation must be fast because the user should not wait too much. However, I think we could use some tool to precompile the kernels into binary form.Look at the Intel's OpenCL Offline Compiler tool for example.

A very good thing about a kernel precompilation tool is that we could use several days of compilation if it's needed to get the most optimized result possible. Days, weeks, etc... In that way you could test all the register combinatory posibilities to get the most optimal result, something like unbiased rendering.

Meteorhead · ‎05-25-2011

Pseudo double-precision as vendor-specific extension for GPUs that do not support DP. Testing (and running) 64-bit kernels would make life easier by a lot.

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150943&enterthread=y

spectral · ‎06-08-2011

Contribution from AMD to libraries like clpp :

http://code.google.com/p/clpp/

😄

MicahVillmow · ‎06-25-2011

raghu,
No, this chip will never be supported. It is based on the RV620 chipset which does not have the required features to support OpenCL.

Atmapuri · ‎06-29-2011

Hi!

To reduce overhead of buffers when it comes to Fusion and CPU implementations of Open CL (which uses common shared host memory), it would make sense to allow the programmer to completely bypass the clCreateBuffer and allow a HOST_PTR to be passed directly to the clSetKernelArg. Currently there is overhead of several miliseconds associated with buffer handling (even for minute sizes). clSetKernelArg has an overhead of 1us. Are there some special reasons why this should not be possible?

It is my understanding that buffers were introduced to handle split memory hardware configuration. But when memory is shared, there is no purpose in the buffer part of Open CL API. (and its associated overhead).

Thanks!
Atmapuri

Meteorhead · ‎06-29-2011

clCreateBuffer need only be executed once every application run, after that it is only a question of copying and maping. A few milisecond is acceptable in my opinion. clSetKernelArg of 1us is as good as it can get, as that has to imply to reinterpret a pointer. Even if it is on same physical memory, kernels use different addresses (which might even be virtual), and thus such calls have to make it through the API (I believe at least for a dozen other reasons too). 1us is about the time of 1 clock tick of a CPU, and that's about as fast as it can get.

Atmapuri · ‎06-29-2011

Originally posted by: Meteorhead clCreateBuffer need only be executed once every application run, after that it is only a question of copying and maping. A few milisecond is acceptable in my opinion.

It is copying and maping that takes a few ms. The size of the setup defines which algorithms can be accelerated with Open CL. If you only think i size of HD images, then that is indeed fine.

Originally posted by: MeteorheadclSetKernelArg of 1us is as good as it can get, as that has to imply to rinterpret a pointer. Even if it is on same physical memory, kernels use different addresses (which might even be virtual), and thus such calls have to make it through the API (I believe at least for a dozen other reasons too). 1us is about the time of 1 clock tick of a CPU, and that's about as fast as it can get.

If the memory is same physical memory it is same memory within the same address space. THere is no other way to put it 🙂

If some address is virtual that affects only GPU devices which dont have common memory.

Thanks!
Atmapuri

LucasCampos · ‎07-03-2011

I'd like to see some built-in random number generator, with a few distributions, such as gaussian and uniform

rougered · ‎07-16-2011

Hi,

the thing i miss the most is BLAS and LAPACK written in OpenCL. Of course performance should be optimal on ATI cards...but they should be portable to other platforms.

also it would be nice if the clpp project was supported more since i believe it has potential and right now it is very slow on ATIs

thank you

Riccardo

dragonxi4amd · ‎07-18-2011

Hi,

We already have Python libraries to support devices for all alpha

i.e APU, BPU, CPU, DPU ....

and would welcome to AMDs APP support to all those devices

~Ronnie

jholewinski · ‎07-25-2011

So, AMD APP support for GPUs on Linux without a running X server has been brought up in this thread multiple times. Could we get an update from AMD on whether or not this is being worked on, or even considered by the driver team?

We do a lot of research on heterogeneous computing, especially GPU acceleration of scientific code. So far, we have primarily used NVidia/Intel hardware, but would like to also explore AMD hardware, especially the Fusion architectures and high-end Radeon cards. However, our experience so far with AMD hardware has been less than ideal since we need to majorly modify the software installations on our compute machines to accomodate a full X server installation along with appropriate permissions and handling of the DISPLAY environment variable depending on the type of application we are running.

Simply put, requiring a running X server on headless compute nodes not only needlessly complicates the software installations, but requires us to carefully balance GPU usage against system RAM usage from the X server.

diapolo · ‎08-06-2011

I would like to be able to use OpenCL via Remotedesktop on Windows ... I dislike that shitty VNC stuff :-/. And it would be nice to group multiple GPUs into one OpenCL device!

Dia

nou · ‎08-06-2011

use OpenCL with RDP is possible with latest AMD APP 2.5

diapolo · ‎08-09-2011

Originally posted by: nou use OpenCL with RDP is possible with latest AMD APP 2.5

Great news, thanks for the info 😉 I uninstalled TightVNC and OpenCL works via RDP!

Dia

FelixLegions · ‎08-10-2011

I'd like to have an optimized version of your FFT algorithm.

Especially for complex 32K FFTs, single precision. 1D. Batches of 12.

My card: 6970

3dmashup · ‎08-12-2011

An OpenCL Kernel Arguments Reflection API,

We need an API call to get back the Kernel arguments metadata

Position, Name, Type, Address space prefix, ...

genaganna · ‎08-12-2011

Originally posted by: 3dmashup An OpenCL Kernel Arguments Reflection API,

We need an API call to get back the Kernel arguments metadata

Position, Name, Type, Address space prefix, ...

3dmashup,

Could you give us some ideas of importance and usage of this?

tweenk · ‎08-12-2011

Originally posted by: genaganna
Originally posted by: 3dmashup An OpenCL Kernel Arguments Reflection API,

We need an API call to get back the Kernel arguments metadata

Position, Name, Type, Address space prefix, ...

3dmashup,

Could you give us some ideas of importance and usage of this?

I'm not the original poster, but I also see how this could be useful. For instance - this could be used to implement named parameter support, default arguments, and runtime type checking fror clSetKernelArg. It could also help with implementing high quality bindings for interpreted languages such as Python. Right now keeping the kernel argument list and the clSetKernelArg calls in host code in sync is rather time consuming and tedious. If there was a way to enumerate and inspect the arguments of compiled kernels, parts of this task could be automated or at least made less error-prone. I realize this would require extensions to the OpenCL specification.

The most obvious place to implement this is clKernelGetInfo. The function could accept cl_kernel_info values such as CL_KERNEL_ARG_TYPE_n, CL_KERNEL_ARG_VALUE_n, CL_KERNEL_ARG_NAME_n, CL_KERNEL_ARG_ADDRESS_SPACE_n and CL_KERNEL_ARG_IMAGE_ACCESS_n where n is a number from 0 to the maximum argument count, which I believe is something like 2048. If this number of defines / enum constants is prohibitive, it could be specified that the constants are defined as a macro like CL_KERNEL_ARG_TYPE(n).

While we are at clGetKernelInfo, it would be useful if it also exported information such as the required work group attribute (e.g. CL_KERNEL_REQD_WORK_GROUP_SIZE).

3dmashup · ‎08-12-2011

Larger image2D max height, width. AMD APP SDK 2.5 supports 8K max

for a Image2D height or width.

We have some very wide but small in height (1px) images.

It would be greate to increase this max limit to 32K or 65K.

genaganna · ‎08-12-2011

Originally posted by: 3dmashup Larger image2D max height, width. AMD APP SDK 2.5 supports 8K max

for a Image2D height or width.

We have some very wide but small in height (1px) images.

It would be greate to increase this max limit to 32K or 65K.

Is it not possible for you to use buffers instead of images?

mosix0 · ‎08-13-2011

Support fork():

It is currently impossible for both a parent-process and its son-process to use OpenCL. This is because once the parent used any OpenCL functions (including even just querying the available platforms and devices without initiating any context), the SDK leaves a state in memory and in the opened-files that creates a botch when the son attempts to use the SDK.

A function such as "clForked()" can solve this issue by resetting the SDK's state.

Also, or alternately, a "clRelinquish()" call by the parent can release all its OpenCL resources, especially closing device-driver file-descriptors.

For example:

clGetPlatformIDs(...);

clGetPlatformInfo(...);

clGetDeviceIDs(...);

/* record number of available GPUs */

if(!fork())

{

clForked();

if(two_or_more_gpus_are_available)

execl("application_parallel_version",...);

else

execl("application_single_gpu_version",...);

}

else

clRelinquish();

/* Parent does other things while child runs OpenCL application */

wait(0);

maximmoroz · ‎08-18-2011

Well, we definately need OpenCL implementation to use DMA to overlap memory transfers by Map/Unmap/Read/Write with kernel execution (either using out-of-order queues or multiple queues). And it should do it for generic buffers. Is there anything that prevents copying data any host memory to any device memory simaltaneously with kernel execution? Hardware needs the data to be copied through pinned memory? Do it under the hood, inside drivers.

spectral · ‎08-25-2011

Hi,

Image I have a library of functions (up to 300 functions).

Behind this I have 4-5 method that I generate dynamically at run time. Theses methods use somes of the '300 functions'.

The problem is that each time I want to run the application I have to recompile the 300 functions and the 5 dynamic functions.

What I would like is to compile the 300 functions and create a "binary" (There is no kernel here). Later when I compile the 5 functions I will also say to the compiler to use this library too !

It is a kind of dynamic linking with binaries !

More info :

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=153507&enterthread=y

corry · ‎09-01-2011

My guess is this has been suggested in one of the other 6 pages of this thread, but please, let us do inline IL how nVidia allows inline PTX. I can already see the compiler is not optimizing my C-like rotations (shifts ands, ors) to bitalign. Would be nice if I could specify.

Add BFI_INT to IL.

Undeprecate CAL.

Make the OpenCL compiler *ALWAYS NO MATTER WHAT* spit out some error message. These compilations that just "Failed To Build" are 100% unacceptable.

Fix *ALL* of the documentation.

Not really an APP thing, but categorize forums, or at least posts based on architecture being posted for. There is so much apparent misinformation for CAL because of the mixture of architectures. I think 99.9% of it is valid, in the context of the architecture the thread is referencing. It becomes very difficult to figure out texture vs vertex vs global vs local caches, max buffer sizes, etc. All this could be cleaned up, and made quite understandable, and quite usable by simply categorizing the forums a little more.

MicahVillmow · ‎09-01-2011

corry,
To address your points.
- BFI_INT will be in one of the upcoming catalyst releases.
- CAL will not be undeprecated.
- Please supply test cases where the error message is insufficient so we can correct the issue.
- Again, please supply examples where the documentation is insufficient so we know where to improve.

corry · ‎09-01-2011

Also, the topic title, guess it didn't get added in, don't forget inline IL. Please, please please, give us inline IL. 90/10 rule still applies on GPUs, and I'd really like to be able to optimize that 10 area, and let the compiler handle the grunt work. I don't care if I have to pragma enable it, or if Khronos throws a fit about it, calls it nonstandard, etc...I really don't! Anyone knows dropping to asm means you're not going to be able to run it on any other platform. In my case, there will only be one platform! I don't need to be guarded against writing platform specific stuff, I know what I'm doing!

I'll pm a response for the rest...Yeah, I can be standoffish, and the response to documentation and compiler I feel warrant it, however, I don't want to steer other people from here. The hardware is incredible, the APU integration has me drueling like crazy, and in general, I think AMD is moving in a very positive direction.....jsut a few quirks that really need some ironing out.

spectral · ‎09-05-2011

In CUDA we can access the buffers in any method without passing it as a parameter. In OpenCL we have to pass it as a parameter ! It is useful when you have a lot of variables and buffers to reach.

For now I use a temporary structure (that contains all the pointers) and pass it to all the method. At least on CPU this buffer should be accessed directly !

It is also a kind of optimization, and easier code.

void myMethod()
{
myBuffer[gid].val = 15;
}

Maybe we should have a kind of '__static' keyword ?!

himanshu_gautam · ‎09-05-2011

Thanks for the suggestion.

BTW, how will static solve the above problem.

spectral · ‎09-13-2011

By example. Tell me if there is something unclear.

void myMethod() { if (gid < _memSize) _memData[gid].val = 15; } __kernel void myKernel(__static __global float* _memData, __static uint _memSize) { .... }

MicahVillmow · ‎09-13-2011

debdatta,basu,
For #6 please see the cl_amd_popcnt extension.

debdatta_basu · ‎09-13-2011

Dear Micah,

I am aware of that extension. However, I wanted it in the core spec, or at least as a khr extension, as Nvidia doesnt have anything similar for opencl yet.

Regards,

Debdatta Basu.

corry · ‎09-15-2011

Can we get a byte order reveral instruction? We have bitalign, and bytealign to let us do byte/bit rotations, shouldn't be hard to add a byte order reversal. In OpenCL this could be exposed much like it is in MSVC for x86 processors (though I guess it would have to have an AMD specific extension attached to it) but from the MSDN page:

unsigned short _byteswap_ushort (
   unsigned short val
);
unsigned long _byteswap_ulong (
   unsigned long val
);
unsigned __int64 _byteswap_uint64 (
   unsigned __int64 val
);

Heck with that, could you give us those openCL instructions, but implement it on the GPU like the x86 SSSE3 instruction pshufb? I can think of a lot of situations pshufb has come in handy! Would absolutly love to have it on the GPU!

tweenk · ‎09-20-2011

Originally posted by: corry Can we get a byte order reveral instruction?

There is no need to add another function that uses this instruction, just emit it when compiling code like this:

unsigned int swapped = as_uint(as_char4(input).wzyx);

Note that it crashed for me on SDK 2.3 and R700 series card when I tried to use it to byteswap floats. I haven't re-tested since then as I converted this to bitwise operations on uints.

corry · ‎09-21-2011

Originally posted by: tweenk
Originally posted by: corry Can we get a byte order reveral instruction?

There is no need to add another function that uses this instruction, just emit it when compiling code like this:
unsigned int swapped = as_uint(as_char4(input).wzyx);
Note that it crashed for me on SDK 2.3 and R700 series card when I tried to use it to byteswap floats. I haven't re-tested since then as I converted this to bitwise operations on uints.

That might work for single dword byte reversal, but where the register components are 32 bits wide, and the entire register is 128 bits wide, there would be a massive increase in byte order reversal performace swapping 128 bits at a time, like how SSE does it with pshufb. with 32 bit componants, a move swizzle just reverses dword orders, which is usless in byteswapping.

In SSE, this is trivial with this

movdqa xmm1, XMMBSWAPVAL;
movdqa xmm0, [rsp+myBufferOffset]
pshufb xmm0, xmm1
movdqa [rsp+myReversedBufferOffset], xmm0
.DATA
align 16
XMMBSWAPVAL:
    DD 00010203h
    DD 04050607h
    DD 08090a0bh
    DD 0c0d0e0fh

Or with intrinsics I suppose

__m128i Source, Dest;
__m128i bSwapVal = { 0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f };
Dest=_mm_shuffle_epi8(Source, bSwapVal);

Of course, that would normally be in a loop over some largeish data, and thats for packed 32 bit integers, but you get the idea. 128 bits at a time, and capable of arbitrary sized inputs, (2, 4, 8, or heck, even 16 byte integers)

Would be nice for when you use the SIMD as an SIMD, as in I have all registers full of 32 bit data, and want to byte reverse each individual 32 bit component. Byte swapping a buffer in some algorithms can account for up to 20% of the time spent on it, just because you happen to be receving in network byte order (the way you're supposed to do it).