Archives Discussions

afo · ‎08-18-2010

New release of SDK is even worse than 2.1...

Dear people:

I returned to an OpenCL project that I was working, and checking amd's web site I see a new SDK (2.2). I installed it with the recomended driver (catalyst 10.7 Update Driver for OpenCL^TM 1.1 Support) and I found this issues:

1) I can not install the kernel profiler, install simple hangs (maybe It is my fault, so I will not blame amd for it, but with 2.1 it installed flawless).

2) The "enable all" pragma extension no longer works: <#pragma OPENCL EXTENSION all : enable> generates an unrecognized OpenCL extension error. I need to enable extensions one by one to get my kernel compiled. This happens in the Stream Kernel Analyzer and with clBuildProgram.

3) I work with a HD5970 with 2GB; with SDK2.1 I had 256MB of ram to work, with SDK2.2 I have 128MB. This was checked building clinfo.exe for sdk 2.1 and 2.2. Is there a good reason to generate a new SDK with half the memory available than the previous one?

Has anybody seen the same or similar things or I am being a victim of a conspiracy? Thanks in advance for any insight about this.

best regards,

Alfonso

bpurnomo · ‎08-18-2010

Regarding 1:

ATI Stream Profiler 1.4 may take a while to install now (~10 minutes on our systems here---the second install is much faster though). It performs some initial setups and registers the plugin with Visual Studio during that time. We are looking for ways to speed up this process. Thank you for the feedback.

d_a_a_ · ‎08-18-2010

Originally posted by: afo Dear people:

I returned to an OpenCL project that I was working, and checking amd's web site I see a new SDK (2.2). I installed it with the recomended driver (catalyst 10.7 Update Driver for OpenCL^TM 1.1 Support) and I found this issues:

3) I work with a HD5970 with 2GB; with SDK2.1 I had 256MB of ram to work, with SDK2.2 I have 128MB. This was checked building clinfo.exe for sdk 2.1 and 2.2. Is there a good reason to generate a new SDK with half the memory available than the previous one?

Try this:

The ATI Stream SDK v2.2 currently defaults to exposing 50% of the physical GPU memory to OpenCL™ applications. Certain developers may require accessing additional physical memory in order to optimize their applications when running on the GPU.

For developers who wish to experiment with increasing the amount of physical memory that is accessible to their OpenCL™ applications, the default 50% setting can be changed by setting the environment variable GPU_MAX_HEAP_SIZE to the percentage of total GPU memory that should be exposed.

For example, if you wanted to set the exposed GPU physical memory size to 75%, you need to the GPU_MAX_HEAP_SIZE environment variable to 75.

GPU_MAX_HEAP_SIZE must be set to an integer value between 0 and 100, inclusive.

http://developer.amd.com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.aspx?ID=123

bubu · ‎08-19-2010

For me, ATI's implementation of OpenCL simply is not production-ready:

1. Each time I try to install the SDK, the ATI Stream Profiler's nephast installation breaks Visual Studio in a way that I must reinstall VS. I won't even mention that takes FOREVER to install ( and, apparently, to execute too ).

Previous version which installed well cannot profile a DLL. That's a big problem because most of the GPGPU programs that exists currently are DLL plugins ( Photoshop, 3dsmax, Maya, Premiere, Nuke, etc... )

And now that I talk about the installer... do you REALLY need the user to click over 40 buttons to install? Use MERGED modules ! Put an unique license at the start, selected the components you want to install, click ONE button and wait... that's how it should be done.

2. That 128Mb memory limitation is ridiculous. I know about the environment var workaround but I think that's not the solution.

No other other OpenCL SDKs like the Cell, the NVIDIA or the KFerry exposes that.

As I said many times, that limit causes more problems than it solves ( see my 3dsmax + PS example in other posts ).

If I bought a 2Gb card I WANT to use 2Gb( - the framebuffer and OS's resources ), not 128Mb. Ridiculous, as I said.

3. The SKA just crashes each time I try to analyze a medium-complex kernel or puts N/A on all.

4. The possible application deployment is a NIGHTMARE. Without OpenCL.dll embedded in the Catalyst drivers or with a much well planned .MSI packaging NO app can be deployed correctly. An example:

Imagine I write a DLL plugin for Photoshop that occupies 50Kb.

I must add to the .exe installer the TWO versions of the SDK 2.2 ( Windows7/Vista + WinXP in x86/x64 flavours too ). So, my deployment will use more than 200Mb ... for a 50Kb plugin.... Cmon !

On the other hand, your (inflated) MSI package does not either expose a silent mode and its neither modular( for instance, an end-user don't need the .h/libs, just the runtime )

Do you know how much is the NVIDIA-version deployment? Z.E.R.O, as yours should be.

5. The console app window appears almost each time you call a function. Really annoying.

6. Why the CPU device does not support images? Is it really so difficult to add a bilinear/mipmap filter? That's no more than 100 C++ lines of code.

Considering that AMD makes CPUs I really cannot understand why the CPU is not an OpenCL's first-class citizen.

7. No DMA nor real asyncronous support. CUDA supports all, even concurrent kernels on Fermi. This is critical to perform GPU<->host transfers fast.

8. I won't even discuss the generic speed ( compared with NVIDIA's OpenCL or CUDA ) or the multiple bugs with your LLVM-clang JIT compiler ( which are not really your fault but LLVM's team ).

9. While the NVIDIA implementation works with any G80 or above card ( with only a speed penalty ), the ATI's 4XXX cards have tons of problems with OpenCL ( btw, the most used to develop according to the own AMD's dev portal's poll ).

10. The lack of a real debugger ( like the NVIDIA's NSight ) is a severe problem, specially if you develop very complex kernels. Printf extension is neither supported on GPU devices.

11. Are you actually contacting and supporting companies offering a GPGPU solution? I ask this because almost each OpenCL's videos I've seen from the Internet says "My program XXXXX does not run yet on ATI. It works only with NVIDIA's card atm".

Incredible, considering this is a 2.2 version and not a Beta.

Sorry if I sound too harsh but I'm actually being tired of posting problems and each new SDK you release does not solve even one of the problems we mention ( for example, the console window ).

fpaboim · ‎08-19-2010

I can say that i'm also particularly disappointed. I've had a 5870 on my desktop for a year and I just bought a laptop with a 5870 mobile (which could just have been named appropriately, but just blame the marketing dept...) to be used for scientific gpu development and had high expectations for the current sdk release, especially considering the long release cycles used by amd with respect to drivers and the sdk compared to nvidia. While the ati has better hardware, it seems nvidia has much better software as a whole to support development and takes better advantage of what it has. I'm sure ATI's developers are extremely competent, and the ones we are in contact here are very thoughtful and helpful but it seems the company is constrained in manpower, $$, time.. well something, since after such extensive development time they can't push out a release that is stable and fixes bugs which are known for some time. Sorry, had to do that, just venting... hope it works out in the next release, i'm bought in ATI now anyway...

ryta1203 · ‎08-19-2010

Originally posted by: fpaboim I can say that i'm also particularly disappointed. I've had a 5870 on my desktop for a year and I just bought a laptop with a 5870 mobile (which could just have been named appropriately, but just blame the marketing dept...) to be used for scientific gpu development and had high expectations for the current sdk release, especially considering the long release cycles used by amd with respect to drivers and the sdk compared to nvidia. While the ati has better hardware, it seems nvidia has much better software as a whole to support development and takes better advantage of what it has. I'm sure ATI's developers are extremely competent, and the ones we are in contact here are very thoughtful and helpful but it seems the company is constrained in manpower, $$, time.. well something, since after such extensive development time they can't push out a release that is stable and fixes bugs which are known for some time. Sorry, had to do that, just venting... hope it works out in the next release, i'm bought in ATI now anyway...

I think AMD simply doesn't have the resources, or has decided not to devote those resources to GPGPU.

I agree that the hardware is there but the software, compared to Nvidia, is junk.

It's confusing that a card with almost 3x the performance of it's competitor can't really compete with said competitor.

Though I will say that Nvidia has the jump because they started their hardware for GPGPU way before AMD/ATI did. That's why OpenCL works with G80 (also that OpenCL is very much like CUDA doesn't hurt).

Though I will say that I'm disappointed that AMD/ATI seems to be stopping real OpenCL support for the 48xx series (though I can only imagine this is due to hardware limitations of the 48xx not fitting in with the OpenCL specs).

afo · ‎08-19-2010

Dear people:

first of all, thanks a lot to all for the feedback, I don't feel so lonely in this journey.

Now my feedback:

1) As bpurnomo said, kernel profiler takes 10 minutes to install/uninstall. I think that this should be commented in the release notes.

2) As I said, #pragma OPENCL EXTENSION all : enable no longer works, but #pragma OPENCL EXTENSION all : disable is recognized by the compiler. Maybe this is documented in someplace, but I didn't find it.

3) I tried GPU_BYTE_ADDRESSABLE_STORE=1; GPU_MAX_ALLOC_SIZE=512 and GPU_MAX_HEAP_SIZE=50 / GPU_MAX_HEAP_SIZE=512, but I still can't alloc a buffer with more than 128MB with SDK2.2 or more than 256MB with SDK2.1. clCreateBuffer returns invalid buffer size and clinfo says Max memory allocation: 268435456 (256MB) for SDK2.1 and 134217728 (128MB) for SDK2.2 Maybe in SDK2.2 one can use 50% of board's memory, but it seems that this can't be done in chunks greater than 128MB. Somebody can confirm that?

Again, thanks for the insights.

best regards,

Alfonso

afo · ‎08-19-2010

Hi all,

After some experimentation I have this new (for me) information:

1) I have an xfx HD5970 black edition board. It says that it has 2GB of ram, but catalyst shows only 1024MB in the hardware information window. Is there a way to see the total memory, or this memory is splitted for the two graphics processors (1GB for each one)?

2) with GPU_MAX_HEAP_SIZE one can set a percentage of that ram in clinfo's global memory size.

3) I did a full uninstall of SDK2.2+cat10.7b; reinstalled SDK2.1+cat10.4; full uninstall of SDK2.1+cat10.4; full install of SDK2.2+cat10.7b and now max memory allocation is 241434624 (230MB) no matter how I play with enviromental variables. By the way: there is a bug in WinXP-32 that if you rename an enviromental variable or delete it, it still figures when you call SET from a command prompt. you should restart the system to vanish the enviromental variable.

best regards,

Alfonso

bubu · ‎08-19-2010

That environment variable is just a very bad idea:

1. Imagine you create an installer that set the GPU heap to 100%.

Now imagine other program sets it to 10%...

Now imagine an user performs a Windows repair ( which might reset the envs var btw )

Now imagine the typical unexperienced user who loves to tweak things starts to touch the env vars...

Now let me remind you that env vars can be defined PER USER. One user might have 30 OpenCL apps so, if he wants to run a lot of OpenCL app at the sime time he must lower the max heap size. On the other hand, other user might want to execute only one... so your installer simply cannot set the env var to a fixed size because it varys too much from an user to other.

As I said, the simple idea to limit the amount of VRAM it can be used is a design problem. If the user has a 2Gb card just let HIM to decide what apps to run, is as simple as that. That's what other OpenSDKs does and that policy works perfectly. The program/user KNOWS how to deal with the amount of memory needed, the implementation itself NOT.

Please, remove that env var mechanism and let us to allocate the maximum VRAM available in the card. If the user tries to open other high-intensive app then that app's clCreateBuffer will simply fail putting an out of VRAM message in the screen and forcing the user to shutdown some OpenCL apps if he wants to run the app. As simple as that.

On the other hand, apps like Photoshop use resource pools so the user can control the maximum quantity of memory to use. That's the way to make the things, BY APP, not by a soddy work env var....

And now a question.... DX10/11 and OpenGL supports VRAM virtualisation ( memory swapping / managed textures ).... Perhaps you should implement a similar system and to define a flag for the developers to control the behavior in the clCreateBuffer call.

But that memory limitation is just a small problem compared with all the problems I mentioned. You really should start fixing that small things that make your SDK completely unuseable.

afo · ‎08-19-2010

Another update:

You need 2 enviromental variables: GPU_INITIAL_HEAP_SIZE and GPU_MAX_HEAP_SIZE; you can set them from 1 to 100 (% of the ram informed by catalyst). I set both of them to the same value (100). With these values global memory size is 1GB and max memory allocation is 256MB.

By the way: "max memory allocation" is always 1/4 of "global memory size", so if you take the default value (50%) you will have 512MB of global memory and 128MB of max allocation memory. I think that this should be in the developer release notes.

hope this helps other people,

Alfonso

malcolm3141 · ‎08-19-2010

The ATI guys don't seem to want to explain this properly, so I'll have a go. The limit to the memory size is necessary only because of the way in which OpenCL is implemented.

In order to be a fully compliant OpenCL implementation, the global memory space has to be a single address space (contigious and linear). This allows pointers to global memory to be stored in global memory and be valid across multiple kernel invocations.

ATI at a low level (drivers) do not seem to deal with pointers to the GPU memory, and instead seem to have some sort of handle system to blocks of memory. (as a side note this seems to also affect resource sharing between APIs, and may explain why AMD requires shared textures and nVidia explicitly non-shared textures). I don't know enough about this lowest level to comment on why they did this, but I am sure there are valid reasons.

Anyway, the upshot of this is that the OpenCL runtime has to allocate a contigious block of memory on the GPU to use as the heap for OpenCL programs. Once allocated, this block of memory is no longer available for any other API (such as DX, OpenGL or even your normal display).

So, raising the OpenCL heap size may cause havok with other apps requiring GPU resources, and so is probably only advisable in compute only situations such as HPC.

On another note, with all this negativity regarding SDK v2.2, I would like to balance that with mention of my appreciation for some of the improvements that I have found useful in this release: Such as the improvements to the IL shader compiler (better register usage, 24bit multiplies, more efficient LDS read and write codegen). Improvements to OpenCL codegen on Evergreen GPUs (imin, imax, mul24, mad24...). And the detailed optimisation document.

Malcolm

genaganna · ‎08-20-2010

In upcoming release, full memory reporting supported.

nou · ‎08-20-2010

so no longer GPU_MAX_HEAP_SIZE? great.

but what about memory virtualisation like in OpenGL or DX. it is planned to the future?

d_a_a_ · ‎08-20-2010

In order to be a fully compliant OpenCL implementation, the global memory space has to be a single address space (contigious and linear). This allows pointers to global memory to be stored in global memory and be valid across multiple kernel invocations.

ATI at a low level (drivers) do not seem to deal with pointers to the GPU memory, and instead seem to have some sort of handle system to blocks of memory.

Thank you for your explanation. But, couldn't the fact of not having single address space (contiguous and linear) be a hardware limitation of the current ATI GPUs?

laobrasuca · ‎08-21-2010

Originally posted by: malcolm3141 The ATI guys don't seem to want to explain this properly, so I'll have a go. The limit to the memory size is necessary only because of the way in which OpenCL is implemented.

In order to be a fully compliant OpenCL implementation, the global memory space has to be a single address space (contigious and linear). This allows pointers to global memory to be stored in global memory and be valid across multiple kernel invocations.

ATI at a low level (drivers) do not seem to deal with pointers to the GPU memory, and instead seem to have some sort of handle system to blocks of memory. (as a side note this seems to also affect resource sharing between APIs, and may explain why AMD requires shared textures and nVidia explicitly non-shared textures). I don't know enough about this lowest level to comment on why they did this, but I am sure there are valid reasons.

Anyway, the upshot of this is that the OpenCL runtime has to allocate a contigious block of memory on the GPU to use as the heap for OpenCL programs. Once allocated, this block of memory is no longer available for any other API (such as DX, OpenGL or even your normal display).

So, raising the OpenCL heap size may cause havok with other apps requiring GPU resources, and so is probably only advisable in compute only situations such as HPC.

On another note, with all this negativity regarding SDK v2.2, I would like to balance that with mention of my appreciation for some of the improvements that I have found useful in this release: Such as the improvements to the IL shader compiler (better register usage, 24bit multiplies, more efficient LDS read and write codegen). Improvements to OpenCL codegen on Evergreen GPUs (imin, imax, mul24, mad24...). And the detailed optimisation document.

Malcolm

i join you malco, theres something definitively not very straight. I have an application which uses opengl/cl interoperability, whose performance is not ok at all with ati hardware/drivers for a very precise reason, the memory management. First of all, what i do is to modify the indices and positions of the mesh with opencl. generally speaking, i have this:

1- AcquireGLBuffers();

2- ModifyIndices();

3- ModifyPositions();

4- ReleaseGLBuffers();

where AcquireGLBuffers() is broken down on:

1.1- clCreateFromGLBuffer();

1.2- clEnqueueAcquireGLObjects();

both for the opengl object buffers containing the indices and the positions. the Modify functions (2- and 3-) are broken down on:

2/3.1- CreateCLMemoryBuffersToRunKernel();

2/3.2- SetKernelArguments();

2/3.3- RunKernel();

where CreateCLMemoryBuffersToRunKernel() is broken down on:

2/3.1.1- clCreateBuffer();

2/3.1.2- clEnqueueWriteBuffer();

the SetKernelArguments() contains only the clSetKernelArg() and in RunKernel() the clEnqueueNDRangeKernel(). Finaly, ReleaseGLBuffers() associate both clEnqueueReleaseGLObjects() and clReleaseMemObject() functions.

for debugging/profiling reasons, i check kernel run time and the overall run time for the entire procedure, where:

- kernel run time represents the sum of all enqueue methods, i.e., clEnqueueWriteBuffer() and clEnqueueNDRangeKernel();

- overall run time represents the elapsed time for the entire process, which includes the kernel run time and all the non profilable opencl commands like clCreateFromGLBuffer()/clCreateBuffer()/clSetKernelArg()/clReleaseMemObject(). i also include the profilable clEnqueueAcquireGLObjects()/clEnqueueReleaseGLObjects() commands into the overall run time.

well, in my head the overall time should be roughly the same as the kernel time since these acquire and create buffer functions should not take time. indeed, for small meshes (with up to few millions of triangles) both overall and kernel time are nearly identical. however, to my surprise, for bigger meshes, e.g., a 17 million triangles model i have, things went awfully bad. first of all, i had errors when using clCreateFromGLBuffer() for the position buffer after acquiring the indices buffer, but no errors if indices buffer was not previously acquired. This is the first time i realize that the use of max_gpu_mem_stuff was required (i have an hd 5770 with 1GB of VRAM). thinking about it, it is pretty obvious since 17MT means ~200MB of indices data and ~100MB of position data. but, the bad thing about all this is that even if it runs, and i have the mesh displayed, the overall run time is badly worse than the kernel run time, like 15ms for kernel time and a miserable 1,7s for the overall time. braking down the overall time, the clCreateFromGLBuffer() for the position buffer takes ~500ms by itself (it is called after the indices buffer is acquired). in the mean time i could lay my hand on an old geforce G80 card (with 1.0 opencl drivers) and tested the very same code and guess what, these ~500ms just desapeared (less than 1ms) and the overall time is almost identical to the kernel time. after this, i went back to the 5770 and could notice that if i don't use more than 256MB of memory, for example, if i modify only the indices, or only the positions, the overall time is almost identical to the kernel time.

well, after all this bla-bla-bla what i want to say is that even if max_gpu_mem_stuff allows us to use more than 256MB in VRAM, it does not fix the slow buffer memory management process. i am pretty surprised with all this, because first of all, with ati opengl drivers i can use nearly as much as 1GB of memory to create object buffers with no gotchas, and second of all, since the opencl context uses the opengl context, i was expecting to be able to use the same memory space/layout.

anyways, i completely ignore how much work it can take to change all this, but something must be done, and i'm pretty confident that you guys will bring us good news in september

nou · ‎08-22-2010

you have flaw in your code. clCreateFromGLbuffer() just once. then you must call just clAcquireGLObject() and clReleaseGLObject().

look into SimpleGL example.

laobrasuca · ‎08-22-2010

i maybe didn't explained it very clearly, but i do call clCreateFromGLbuffer() just once, then i call clEnqueueAcquireGLObjects(), treat the data and at the end i call clEnqueueReleaseGLObjects() and clReleaseMemObject(). but i need to do it for both index and position opengl buffer objects. as for the SimpleGL example, i did took it as a reference, btw, i do things in the exactly same order that it does.

but, anyway, since the same very code works just fine in nvidia cards, is undoubtedly implies that ati drivers are not 100% ok regarding the memory management. and this is what i want to echo here. you can the do the AcquireGLBuffers() test by yourself and you will see.

laobrasuca · ‎08-22-2010

i wonder if the new hd6000 series based cards will have a hardware even more close to openCL specs. I mean, since the hd4000 series does not support a number of openCL features coz it was designed before openCL, since hd5000 was design at least in part to better accommodate openCL programming model but have driver issues, i wonder if these drivers issues are not a reflex of the hd5000 hardware which are not designed to support 100% to support openCL. Maybe the openCL model is fairly far from GL or DX models, which are more important to support than GPGPU models since these are game cards. Really duno, just wondering ^^,

on the other side, since G80 nvidia's cards, the hardware have being designed to support CUDA, which undoubtedly is the base model for openCL. that's maybe why the nvidia cards/drivers are less prone to flaws when it comes to execute openCL codes due to their maturity.

moozoo · ‎08-23-2010

Re: HD5970

My understanding is that each GPU has 1GB of memory which is not shared.

I think AMD have huge potential to displace Nvidia from the GPGPU market.

All they need to do is not cripple DP performance, fully support CPU OpenCL and deliver a solid implementation of OpenCL that does not require an SDK install.

philips · ‎08-19-2010

The thing about the buffer size limitation is interesting. I read about that workaround, but I thought it referred only to the full memory size you can access with more than one buffer.

Have you tried how much memory you can access with one buffer?

If I have a 1GB card, can I somehow access more than 256 MB in one buffer?

Illusio · ‎08-19-2010

Originally posted by: philips

If I have a 1GB card, can I somehow access more than 256 MB in one buffer?

Yeah. I have no issues with that using these environment variables.

GPU_BYTE_ADDRESSABLE_STORE=1
GPU_MAX_ALLOC_SIZE=512
GPU_MAX_HEAP_SIZE=512

philips · ‎08-19-2010

Thank you.

Now I can access 256 MB of 1024. That s already a big improvement.

But somehow I still can t get more than that even though I ve set your variables, Illusio

nou · ‎08-19-2010

GPU_MAX_HEAP_SIZE was changed from MB to %. so set it to 50 and you should get 512MB

Illusio · ‎08-19-2010

Strange. Works fine for me. Did you try with max alloc size=512 or did you ramp it all the way up to 1024? I guess it's possible that there's a max of .5 times physical ram. /shrug For the record, I've been allocating 400ish MB. It fails when I go above 512 even if I ramp the environment variables up.

I'm using a 5870 with 1024MB ram anyway.

Edit: Nevermind me. My memory test app had the CPU device hardcoded. Also, there are posts on this forum stating that the MAX_ALLOC_SIZE option was discontinued at one point. It fails like on your end once I use the GPU. So it looks like what you've got now is the best you can get a present. It's been like a year since I messed with those variables.

MicahVillmow · ‎08-23-2010

afo,
"all : enable" is disallowed by section 9.1 of the OpenCL spec.

afo · ‎08-23-2010

That is correct...My fault, please apologize.

best regards,

Alfonso

rick_weber · ‎08-23-2010

Why did the cl_khr_fp64 extension change to cl_amd_fp64 in v2.2? Also, const sampler_t (the correct way to declare a sampler as per Section 6.11.13.1) worked in v2.1 if I recall, but now you have to use __constant sampler_t. Why this change? These tidbits make portability more painful between Nvidia and AMD OpenCL, as you have pass the platform vendor as a #define and then switch on that.

Other than that, I haven't had any problems and am glad to finally see double precision FMA.

timchist · ‎08-29-2010

Originally posted by: MicahVillmow afo, "all : enable" is disallowed by section 9.1 of the OpenCL spec.

That's right. However what should I do if I have a kernel that has implementation both with using some extension and without it and using #ifdef cl_khr_.... to tell the compiler which one to use, depending on extensions supported by the current implementation. For example, I have code that compiles under NVIDIA OpenCL and RV770 and uses different approaches when cl_khr_byte_addressable_store is supported and when it is not.

Some way is needed to tell the compiler to include certain extensions only if they are supported. I cannot include '#pragma OPENCL EXTENSION cl_khr_byte_addressable_store enable' in the code, since it won't compile under RV770. But if I don't do it, the extension won't be used when compiling under NVIDIA implementation as well, which is not what I want.

How can this problem be solved?

rick_weber · ‎08-29-2010

I solved this by doing the following:

Query the OpenCL device to get it's platform vendor.

Make this string all uppercase and change commas, periods and spaces into underscores.

When compiling, pass -D and the platform string.

Now in your kernel, you can do:

#ifdef NVIDIA_CORPORATION
#pragma blah_khr : enable
#else
#ifdef ADVANCED_MICRO_DEVICES__INC_
#pragma blah_amd : enable
#endif
#endif

timchist · ‎08-29-2010

Originally posted by: rick.weber

Now in your kernel, you can do:

#ifdef NVIDIA_CORPORATION #pragma blah_khr : enable #else #ifdef ADVANCED_MICRO_DEVICES__INC_ #pragma blah_amd : enable #endif #endif

Thanks, Rick, but this is only a partial solution. I try to use the Stream SDK offline compilation, so I get binaries for all the AMD devices at one call to clBuildProgram, so I can provide -D only once. AFAIK, some AMD devices support cl_khr_byte_addressable_store, while some others do not. So it would be nice if cl_khr_byte_addressable_store directive was defined when compiling for the devices with this feature, and was not defined for the others.

rick_weber · ‎08-30-2010

What I generally do is create a cl_program for each device. That way, each device can receive its own definitions.

malcolm3141 · ‎08-30-2010

See OpenCL spec section 9.1...

If a compiler supports a given extension, then there will be a define with the same name as the extension.

I haven't checked the AMD compiler to see how it behaves in this respect, but if this isn't so, then I'd suggest submitting a bug report.

Malcolm

genaganna · ‎08-30-2010

You can write some thing like this.

#ifdef cl_amd_printf

#pragma OPENCL EXTENSION cl_amd_printf : enable

#endif

timchist · ‎08-30-2010

Originally posted by: genaganna You can write some thing like this.

#ifdef cl_amd_printf
#pragma OPENCL EXTENSION cl_amd_printf : enable
#endif

That's cool. So cl_xxx are defined even before the extension is enabled by corresponding #pragma?

genaganna · ‎08-30-2010

Originally posted by: timchist
Originally posted by: genaganna You can write some thing like this.

#ifdef cl_amd_printf

#pragma OPENCL EXTENSION cl_amd_printf : enable

#endif

That's cool. So cl_xxx are defined even before the extension is enabled by corresponding #pragma?

Yes. but you should enable extension if you want to use.

timchist · ‎08-30-2010

Yes. but you should enable extension if you want to use.

Ok, thanks genaganna, that makes sense. I will try it.

nou · ‎08-30-2010

older nvidia implementation didn't define cl_XXX_khr. do not know current state.

MicahVillmow · ‎08-23-2010

rick.weber,
These changes were made to fall in line with both Khronos and OpenCL requirements. We cannot accept the cl_khr_fp64 extension until we pass conformance, which we currently are in the process of doing. In order to allow people to use double precision in OpenCL before we achieve conformance, we have a vendor extension cl_amd_fp64.

Thanks for the second one, we will look into getting that fixed.

Archives Discussions

AtiStream 2.2: It's supposed to work?