Archives Discussions

sgratton · ‎05-04-2011

Request for information on what this entails

Hi there,

I've noticed with some disappointment the mention in the release notes of 2.4 the intention to deprecate CAL as of 2.5.

I think this is sad because for example CAL allowed one to do useful things with older yet still powerful cards such as the 4800 series but more importantly one could for example use a single buffer almost up to the memory size of the card (in fact the memory size minus 256MB), as opposed being limited to 256 MB.

I'd like to ask about the practical implications of the deprecation:

1/ Will 2.5 and onwards still come with CAL headers?

2/ If so, will they be updated with new caltargets as new cards are released, if not any new functionality, or will they be totally frozen?

3/ Will the drivers continue to come with CAL libraries?

4/ If not, to continue to use IL will I have to stick with cat 11.4 and sdk 2.4 or something?

5/ Presumably IL is not going to replaced? So will the IL (and ISA) documentation continue to be released?

6/ Will AMD consider releasing details of the CAL image format, crucially including info about the programinfonotes, to enable interested people to use cal with the latest ISA instructions by editing the program binary?

Thanks,

Steven.

himanshu_gautam · ‎05-11-2011

sgratton,

These questions seek to reveal some information which AFAIK is confidential.

So all I can say is, AMD recommends that OpenCL should be used while developing on AMD platforms. CAL libraries must continue to ship with drivers as opencl depends on that.

Maybe someone else can provide more information.

Wait for an announcement at the release of SDK 2.5 .

MicahVillmow · ‎05-11-2011

sgratton,
One thing you can do now, although we don't official provide support for this, is disassemble an OpenCL binary and replace the IL in the binary with the IL you want to execute. The OpenCL binary is based on the ELF container format, so any tools like readelf/objdump can read them. As long as the basic structure of the binary doesn't get modified, the computation in the function body can be modified to do whatever the developer would like.

sgratton · ‎05-12-2011

Hi Himanshu and Micah,

Thanks for the comments and suggestion.

Micah: In fact I've been trying something very similar with CAL images (see http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=147385), but the programinfo note in the image seemed to be describing things such as code length, number of gpu's used and so on. So I was nervous about changing the ISA without knowing how to correspondingly alter the programinfo note. (Hence my request for more documentation on this.)

But I haven't looked as carefully at opencl binaries. Is the situation any different for them; do they not have the equivalent of a programinfo note?

Thanks,

Steven.

ryta1203 · ‎05-12-2011

I really don't understand why AMD is deprecating CAL. It's like they are giving the developer one more reason to use CUDA, seems counter-productive from AMD's standpoint.

MicahVillmow · ‎05-12-2011

sgratton,
As long as the ELF binary offsets are correct for all of the sections and the IL structure is correct, then you can add pretty much whatever you want in the function body. OpenCL is using ELF as our binary format, and as long as the binary is well formed, the OpenCL runtime should accept it.

bubu · ‎05-12-2011

As AMD's Richard Huddy said some time ago ( http://www.bit-tech.net/hardware/graphics/2011/03/16/farewell-to-directx/1 ), all the GPUs could run faster with a close-to-metal interface. Open standards are important when you want to keep compatibility or you want productivity. However, sometimes you don't want productivy and neither compatibility, you want SPEED and specific features attached to the HW.

For example, game consoles usually don't include an OpenGL interface. Playstation 3 uses it's own close-to-metal API and SPUs are programmed using pure assembly ! Other example is NVIDIA's CUDA.

So, I think you really should continue supporting open standards like OpenGL or OpenCL BUT you should also keep the close-to-metal layer for speed... so the programmers can choose between compatibility/productivity or pure speed/features.

I personally would prefer close-to-metal APIs all the time because our company has enough resources and time and what matters for us is to release a final product very optimized and specifically designed for the GPU/console XXXXXX... so I think to deprecate CAL is bad.. and I also think you should expose a close-to-metal API for the DirectX/OGL layer too ( like 3Dfx did with Glide ).

Thanks.

sgratton · ‎05-13-2011

Hi everybody,

Thanks for your comments ryta1203 and bubu; as you might expect I concur entirely that it would be better if CAL/IL weren't to be deprecated. Thanks too for the link to that article bubu; I naturally agree with the point of view presented there but it does seem a bit odd for it to be expressed particularly in the light of what has been suggested for CAL (and indeed the ongoing lack of a working assembler in CAL for the more recent cards).

Micah, I've now looked inside an opencl ELF binary image. I was a bit surprised to find that the .text section is itself in fact a CAL ELF binary image! This means though that one is back to knowing about the CALprograminfo note details if one wishes to safely modify the ISA doesn't it?

Thanks,

Steven.

Meteorhead · ‎05-13-2011

I agree that it is very strange that AMD is trying to move in the direction of giving developers what they want, namely low-level access to harware, and in the meantime what do they do? Deprecate CAL. ?_?

I personally would very much enjoy to see that IMMENSE potential come to life that today's graphics HW possess. Even a Mobility 5870 has over 1 TFLOPS of power, not to mention coming Wimbledon GPUs...

I believe every game manufacturer company could afford two render developing teams, one specialized for AMD and one for NV. The smaller groups could stick with the APIs, but the really Big Fish companies definately would pull this off. Competition is great, and if it's the question of paying 20 intellingent shader programmers specialized on a vendors low-level language to make to games strikingly beautiful, then they will make it happen.

We see GPUs evolve exponentially, but we don't see the result diverging the same way... sad indeed.

MicahVillmow · ‎05-13-2011

sgratton,
The OpenCL binary approach lets you utilize the OpenCL API but use IL for your kernel language. While not ISA directly like you want, that will come in the future when the CAL binary image is removed.

LeeHowes · ‎05-13-2011

I understand the concerns about dropping CAL. However, Micash is pointing out that IL can be used through the OpenCL interface. What does CAL give beyond this? How is CUDA any lower level than OpenCL? CUDA doesn't give you access to the ISA any more than what Micah is describing does - it gives you PTX which is similar to IL.

Meteorhead · ‎05-13-2011

Could someone explain the hierarchy of all the three letter abreviations beside CAL? (ELF, ISA, IL, PTX) CAL is a language similar to OCL, somewhat lower level with more access global shared memory and the likes. IL is Intermediate Language but I do not know where it is compared to ISA (not to mention PTX). If someone could sum up these abbreviations in 6-10 sentences, I would really appreciate that.

rick_weber · ‎05-13-2011

Both IL and PTX are generic pseudo assembly code languages that can be quickly compiled to device-specific ISA (Instruction Set Architecture: the actual assembly language for a specific device) when kernels execute. The advantage of this is that you can distribute the IL with your application and it will work regardless of the actual video card the user has. This idea is similar to bytecodes used by just-in-time compilers.

ELF is an executable format, typically used in Linux.

As for the heirarchy:
On ATI:

OpenCL --compiles to-> IL --compiles to-> ISA --links to-> ELF executable

On NVIDIA:

OpenCL --compiles to-> PTX --compiles to-> ISA --links to-> executable format of some kind

CAL itself is not a language but a frontend API for running IL kernels. The argument for its deprecation is that if you know what you're doing, you can shove whatever IL you want into a binary ELF file, call clCreateProgramFromBinary() and it will run.

sgratton · ‎05-14-2011

Hi everybody,

Micah, it's great to hear that ISA access is on the horizon! I do hope it works out.

Lee, except for CUDA 4.0 now allowing for inline PTX, I agree that CUDA and OpenCL are at the same level. However, remember that IL was the only way to go for a long time, and that on AMD platforms there are many restrictions with OpenCL that do not affect CAL. For example, I can actually have a matrix on the GPU greater than 256 MB (still "small") rather than having to mess around segmenting it (though I still have to mess around to access it from the CPU), and can use the whole memory of the card (less 256 MB). I can use my older cards with IL (3870, 4870) should I choose to, and IL, though limited, works properly (unlike DP math on 4800 or 6900 in OpenCL for example). CUDA (and very probably OpenCL on Nvidia) doesn't have such problems.

My feeling is that for GPU computing to be really worthwhile one has to be able to get almost optimum performance from a given card within a short while of its appearing, say a month (this is one of CUDA's strengths). The only way I can see this as possible with AMD is by using ISA, or something like it. OpenCL or IL unfortunately aren't an option, for at least two reasons:

1/ At present, the best AMD can do to help people when there is a problem/regression is to update the catalyst driver, but that takes months, after which your new GPU is getting old, or your old GPU is getting very old (consider for example an issue with burst writing I reported on 4800 cards around Christmas).

2/ There is no control over compiler optimization (something that has been asked for for years). For example, my IL matrix multiplication-like kernels have been messed up, using far too many registers and getting in a big tangle with reading data in inefficiently. I thought this was partially to do with the VLIW5 nature of the older cards, so when I heard about the newer VLIW4 cards I thought I'd try again. However, that doesn't seemed to have made much difference; the ISA from IL is still a mess.

On to ways of using IL in the future, why should one want to be poking about in an unsupported manner in IL images in an opencl binary? That doesn't sound good.

Similarly, why should one be forced to fiddle with ISA images in a CAL binary? I've only been doing this as a very last resort, trying to get somewhere close to the potential performance and so to justfiy using GPUs at all. This shouldn't be the only option left to people.

Personally, I think that AMD (and Nvidia perhaps but as I mentioned their CUDA seems to do much better by itself) needs to provide a simplified "quasi-assembly" language that doesn't (necessarily) get optimized, supports the vector nature of registers, but sorts out all of the tedious stuff (code layout etc.), giving "what you see is what you get", and of course sorts out all the ancillary data hidden in the calprograminfo note. For example, I'd like to be able to write things like:

...

LOAD(1) R1,R0.x

VECMUL R2,R1,R0.x

VECADD R4,R3,R2

MUL R5.x,R4.y,R2.z

...

and have these expand into the appropriate VLIW clause with all the decorations. With about 5 instructions one could do most of linear algebra almost optimally. The way I could have done this myself would have been via a preprocessor feeding in to calAssembleObject() but the latter doesn't work...

I'm sorry if this seems a bit pessimistic but in the end, despite their potential, neither my 3870 nor 4870 got close to doing a useful calculation for me in my work; I am still hoping the 6950 doesn't fare similarly. Perhaps there is a subtle difference between "stream computing" and the new "accelerated parallel processing", the latter being more flexible and needing a higher level language. However, I don't mind spending some time (though not as much as I have spent so far!) to get a simple kernel such as matrix multiplication or Cholesky factorization to work really well on the hardware I own. It'd be nice to be able to do so.

What do you experts at AMD think? Do you see what I am getting at? Certainly at least some other users have expressed a similar view.

Best wishes,

Steven.

adm271828 · ‎05-14-2011

Originally posted by: sgratton

What do you experts at AMD think? Do you see what I am getting at? Certainly at least some other users have expressed a similar view.

Hi Steven,

You are not alone, see here: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149806&enterthread=y

I think it's a pity that AMD is currently unable to make a clear statement about their software vision, especially regarding the problem of how to get the maximum performance from the hardware. I foresaw the deprecation of CAL+IL, even if at that time AMD was saying it would continue to support IL (not such a long time ago...). But I never imagined they would deprecate it without a working replacement solution to provide low level access to the hardware... which seems to be the case.

I appreciate Micah's effort to provide workarounds, but directly putting IL code into the ELF looks like a joke for any non-toy project. BTW: shall we put IL in text form into the .amdil section, or in token form in the second embedded ELF file that appears in the .text section for every kernel?

Best regards,

Antoine

Jawed · ‎05-15-2011

Originally posted by: sgratton For example, I can actually have a matrix on the GPU greater than 256 MB (still "small") rather than having to mess around segmenting it (though I still have to mess around to access it from the CPU), and can use the whole memory of the card (less 256 MB).

My OpenCL BLAS library has no problem with image support, using image2D buffers that are >256MB. I can create a single buffer containing many matrices, and multiply arbitrary matrices from within the buffer and write the results to the same buffer. (OpenCL spec says that this isn't supported but it works.)

Or I can take a 512MB matrix and square it and put the result in a new 512MB buffer. All on a 1GB card.

Basically this works the same as my CAL version did.

My feeling is that for GPU computing to be really worthwhile one has to be able to get almost optimum performance from a given card within a short while of its appearing, say a month (this is one of CUDA's strengths).

That's a pipedream.

The only way I can see this as possible with AMD is by using ISA, or something like it.

ISA is extremely complex. It's essentially impossible to write by hand and has to be generated. e.g. the execution pipeline has to be modelled by the compiler to obey register timing rules (VEC_012 etc.).

OpenCL or IL unfortunately aren't an option, for at least two reasons:

1/ At present, the best AMD can do to help people when there is a problem/regression is to update the catalyst driver, but that takes months, after which your new GPU is getting old, or your old GPU is getting very old (consider for example an issue with burst writing I reported on 4800 cards around Christmas).

2/ There is no control over compiler optimization (something that has been asked for for years). For example, my IL matrix multiplication-like kernels have been messed up, using far too many registers and getting in a big tangle with reading data in inefficiently. I thought this was partially to do with the VLIW5 nature of the older cards,  so when I heard about the newer VLIW4 cards I thought I'd try again.   However, that doesn't seemed to have made much difference; the ISA from IL is still a mess.

I agree, these things are a perennial problem. I don't see any solution. My IL matrix-matrix code used to run at 1.75 TFLOPs (this is true multiplication, not one that relies upon A being transposed first, and supports arbitrary matrix sizes) but later Catalyst versions reduced this to ~1.4 TFLOPs.

The OpenCL version of my algorithm simply doesn't work as the compilation is erroneous (wasted tens of hours getting to the bottom of that - I haven't tested with SDK 2.4 yet). So I have to use a naive algorithm whose performance, incidentally, has decreased by a couple of hundred GFLOPs with SDK 2.4 and Catalyst 11.4.

On to ways of using IL in the future, why should one want to be poking about in an unsupported manner in IL images in an opencl binary? That doesn't sound good.

Agreed. Plus the run-time interface of CAL is lost, which has a certain precision to it.

Similarly, why should one be forced to fiddle with ISA images in a CAL binary? I've only been doing this as a very last resort, trying to get somewhere close to the potential performance and so to justfiy using GPUs at all. This shouldn't be the only option left to people.

Having worked on things other than BLAS, I think what this boils down to is that most other people's applications are rarely going to get 1% of the optimisation effort that goes into making BLAS stuff work well.

It's worth noting it took a few years for matrix-matrix multiplication performance to get where it's supposed to be on NVidia - early attempts were laughably pitiful. So the idea that it's "easy" to just write an optimal matrix-matrix multiplication on AMD with some close-to-the-metal code is similarly laughable. My original "optimal" IL algorithm has ~8:1 ALU:TEX and the 1.75 TFLOPs it achieved was still a few hundred GFLOPs short of what the hardware is capable of, due to poor compilation.

AMD's OpenCL compilers are still trying to achieve correctness. At the same time the complexity of optimisation (GPR count versus instruction count versus VLIW woe) is causing absolute performance to vary. This effectively undermines any attempt at hyper-optimisation.

I suspect AMD's motive for removing CAL/IL support is a simplification of effort for the new chips as they arrive. The quality of the Cayman technical documentation is very poor, clearly indicating a rushed job.

ryta1203 · ‎05-18-2011

@Lee: When I said "little reason not to use CUDA" I didn't mean it was at a different level but that it totally outperforms AMD's OpenCL. Their compiler is just so much more mature and the simplicity of the hardware (non-VLIW) makes for easier optimizations and better overall usage. CUDA also outperforms AMD in just about every irregular application or memory bound application that I can think of.

@Jawed: I disagree about the MM for CUDA. Volkov's SC 08 paper shows good percentages very early on, particularly when you think of the timeframe of SC and the months and months of submitting, reviewing, finalizing to actual being published.

Meteorhead · ‎05-18-2011

Ryta, it is not true that NV outperforms ATi in every application. OpenCL is a port to CUDA in the NV compiler (the same way as it is a port to CAL on AMD HW) and there is a certain application that employs HEAPS of random number generations and bitshifts and bitwise operations. Basically the only floating point operation is the random number comparison to a preset value. On NV cards, FP and INT unit is seperate HW and one sits idle almost the entire time. 1 Cypress outperforms a C2070 by 1.44X, and 1 Hemlock by 2.31X. Memory movement is minimal.

VLIW architecture is very powerful and is the strength of AMD. That should not be modified.

ryta1203 · ‎05-20-2011

Originally posted by: Meteorhead Ryta, it is not true that NV outperforms ATi in every application. OpenCL is a port to CUDA in the NV compiler (the same way as it is a port to CAL on AMD HW) and there is a certain application that employs HEAPS of random number generations and bitshifts and bitwise operations. Basically the only floating point operation is the random number comparison to a preset value. On NV cards, FP and INT unit is seperate HW and one sits idle almost the entire time. 1 Cypress outperforms a C2070 by 1.44X, and 1 Hemlock by 2.31X. Memory movement is minimal.

VLIW architecture is very powerful and is the strength of AMD. That should not be modified.

I think you need to reread my post.

ryta1203 · ‎06-13-2011

So Nvidia is letting developers embed PTX code? Is this correct? Does AMD have any plans for this type of implementation in the future?

clamport · ‎05-16-2011

Where might I find examples of this?

Thanks!

sgratton · ‎05-16-2011

Hi everybody,

Thanks for your comments. Thanks Antoine for the link to the thread you mention. There is lots of interesting relevant stuff there. Jawed, you are of course right that it would be very tedious to have to deal with VEC_012 etc. by hand; this is something I too would like to have sorted out automatically! However, control flow and TEX instructions are all pretty straightforward. Something like unoptimized IL, respecting ordering of tex and alu instructions, would be good start. I also agree that achieving optimal performance after 1 month is not likely to happen; rather I feel that the available tools should not stand in the way of this goal.

Perhaps I am out of date with OpenCL, having until recently been using 4800 class hardware: Does one still have to adjust GPU_MAX_HEAP_SIZE to use all the memory on the card? And what about the CL_MAX_MEM_ALLOC_SIZE setting?

Best wishes,

Steven.

Jawed · ‎05-17-2011

Image buffers give you access to essentially all the memory, though a single buffer can't be as large as the card's memory. I've never used those environment variables.

The memory allocation size restriction problems people have are with normal buffers as opposed to image buffers.

Micah, some examples of errors in Revision 1.0 (February 2011) of HD 6900 Series Instruction Set Architecture:

Section 2.3 erroneously includes "local data share" as a type of clause, a feature that is restricted to R7xx GPUs
Table 2.7 says that an ALU clause contains 5 ALU_WORDs
Section 4.3, second sentence, repeats the error relatig to 5 ALU instructions
Section 4.4 doesn't present the algorithm it says it is presenting.

quadboon · ‎05-17-2011

I was very happy when i read about the deprecation of CAL. I have two reasons:

The first one is very selfish. Since all my GPGPU projects base on OpenCL I hope AMD can now move some developer ressources from CAL development to OpenCL development and by doing this finally have the manpower fixing all these annoying bugs in OpenCL that havent been fixed for years. Or add some needed features.

The second is that CAL requires kernel code written in IL. IL? What is that? If you are new to GPGPU you might find this is a very strange language. I can talk from my own experience. With OpenCL you dont need to learn a new language. You can simply code in C. I can see why many developer think CUDA is easy. Its because they can write in C. In OpenCL they can do it, too. So that most powerful argument they had was that its easy. Its not more easy than OpenCL now.

This is just my opinion. Good decision AMD!

adm271828 · ‎05-17-2011

Originally posted by: quadboon

The second is that CAL requires kernel code written in IL. IL? What is that? If you are new to GPGPU you might find this is a very strange language. I can talk from my own experience. With OpenCL you dont need to learn a new language. You can simply code in C. I can see why many developer think CUDA is easy. Its because they can write in C. In OpenCL they can do it, too. So that most powerful argument they had was that its easy. Its not more easy than OpenCL now.

This is just my opinion. Good decision AMD!

Programming GPGPU is not a language issue, it is about understanding how the hardware works, and about developping new algorithms, new data representations, ... in one word: new way of thinking that will allow you use the hardware at its best. The effort to learn a new language is nothing compared to this. When Jawed says he spent many hours to optimize his BLAS implementation, I think it was certainly not a language issue.

AMD's decision to drop CAL/IL is not a bad decision. The bad decision is to do it without being able to provide a working replacement solution to access the hardware at low level (I will start to repeat my-self...).

I'm still waiting for AMD to provide us their vision about this, but I'm not that much confident, because I think they are in a rush.

Best regards,

Antoine

himanshu_gautam · ‎05-19-2011

Originally posted by: Jawed Image buffers give you access to essentially all the memory, though a single buffer can't be as large as the card's memory. I've never used those environment variables.

The memory allocation size restriction problems people have are with normal buffers as opposed to image buffers.

Micah, some examples of errors in Revision 1.0 (February 2011) of HD 6900 Series Instruction Set Architecture:

Section 2.3 erroneously includes "local data share" as a type of clause, a feature that is restricted to R7xx GPUs
Table 2.7 says that an ALU clause contains 5 ALU_WORDs
Section 4.3, second sentence, repeats the error relatig to 5 ALU instructions
Section 4.4 doesn't present the algorithm it says it is presenting.

Thanks for reporting this.

sgratton · ‎05-19-2011

Hi there,

Himanshu, you might also want to remove the trans unit from fig. 4.3!

Jawed, I remember now why I couldn't use images: I wanted to do an in-place algorithm (Cholesky factorization) and writes and reads to the same image in the same kernel are not allowed.

In any case, I am a bit puzzled as to why images aren't also bound by the CL_MAX_MEM_ALLOC_SIZE restraint. That seems to set the max size of a memory object allocation, and an image is a type of memory object...

By the way, is there an environment variable one can set to adjust this, like with GPU_MAX_HEAP_SIZE, even if it is experimental?

Thanks,

Steven.

MicahVillmow · ‎05-16-2011

adm271828,
Put it in text form in .amdil section and use the compiler options to strip out the other sections. Emitting your own OpenCL binary is actually very useful when targeting OpenCL from a non-OpenCL language. Think about a Fortran to OpenCL compiler. If you can emit your own binary, you wouldn't have to do source to source translation and can generate IL directly which gives the compiler more options for optimization control.

Jawed,
If you have issues with the Cayman docs, please let us know. I work with the documentation writer all the time to get issues fixed. Also, if you know of any performance regressions that we can test, I'd be interested in hearing about them. We try not to regress on performance, but unless we have a test binary, we can't gaurantee it.

MicahVillmow · ‎06-13-2011

Ryta,
Do you have a reference? Would be interested in seeing if they do allow this. Also, AMD is only deprecating CAL support, not IL support. OpenCL compiler targets IL, so this will still be used and updated.

Meteorhead · ‎06-13-2011

Let me ask something very naive: if CAL translates to IL, just like OCL, and CAL has GWS, I take it that there's a single IL instruction that stands for GWS. If this is true, how come there'is no GWS in OpenCL?

Inserting IL code into OCL is basically all the same thing that porting CAL to OCL.

Since this is not done yet I take it that it takes a lot more on IL level to achieve GWS.

ryta1203 · ‎06-14-2011

Originally posted by: MicahVillmow Ryta, Do you have a reference? Would be interested in seeing if they do allow this. Also, AMD is only deprecating CAL support, not IL support. OpenCL compiler targets IL, so this will still be used and updated.

http://developer.nvidia.com/cuda-toolkit-40

Ability to inline PTX assembly code.

So can you use OpenCL/IL like you might have with CAL/IL? Or do you the kernels need to be written in OpenCL also?

Jawed · ‎07-04-2011

sgratton, you might want to try again with reading and writing the same image buffer in your OpenCL app. Now I haven't done this very recently (i.e. the last few months) but the last time I tried it worked fine. The spec says it's not supported, but it worked.

MicahVillmow · ‎06-13-2011

GWS should show up at the IL level in SDK 2.5. I haven't had time to get it moved into OpenCL yet.

Meteorhead · ‎06-13-2011

Should it make it into OCL (as an extension most likely), that would be very much welcome by a lot of people. Thanks Micah for the info.

Archives Discussions

Implications of the deprecation of CAL