cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

LeeHowes
Staff
Staff

Implications of the deprecation of CAL

I understand the concerns about dropping CAL. However, Micash is pointing out that IL can be used through the OpenCL interface. What does CAL give beyond this? How is CUDA any lower level than OpenCL? CUDA doesn't give you access to the ISA any more than what Micah is describing does - it gives you PTX which is similar to IL.

0 Likes
Meteorhead
Challenger

Implications of the deprecation of CAL

Could someone explain the hierarchy of all the three letter abreviations beside CAL? (ELF, ISA, IL, PTX) CAL is a language similar to OCL, somewhat lower level with more access global shared memory and the likes. IL is Intermediate Language but I do not know where it is compared to ISA (not to mention PTX). If someone could sum up these abbreviations in 6-10 sentences, I would really appreciate that.

0 Likes
rick_weber
Adept II

Implications of the deprecation of CAL

Both IL and PTX are generic pseudo assembly code languages that can be quickly compiled to device-specific ISA (Instruction Set Architecture: the actual assembly language for a specific device) when kernels execute. The advantage of this is that you can distribute the IL with your application and it will work regardless of the actual video card the user has. This idea is similar to bytecodes used by just-in-time compilers.

ELF is an executable format, typically used in Linux.

As for the heirarchy:
On ATI:

OpenCL --compiles to-> IL --compiles to-> ISA --links to-> ELF executable

On NVIDIA:

OpenCL --compiles to-> PTX --compiles to-> ISA --links to-> executable format of some kind

CAL itself is not a language but a frontend API for running IL kernels. The argument for its deprecation is that if you know what you're doing, you can shove whatever IL you want into a binary ELF file, call clCreateProgramFromBinary() and it will run.

0 Likes
sgratton
Adept I

Implications of the deprecation of CAL

 

Hi everybody,

 

Micah, it's great to hear that ISA access is on the horizon!  I do hope it works out.

 

Lee, except for CUDA 4.0 now allowing for inline PTX, I agree that CUDA and OpenCL are at the same level.  However, remember that IL was the only way to go for a long time, and that on AMD platforms there are many restrictions with OpenCL that do not affect CAL.  For example, I can actually have a matrix on the GPU greater than 256 MB (still "small") rather than having to mess around segmenting it (though I still have to mess around to access it from the CPU), and can use the whole memory of the card (less 256 MB).  I can use my older cards with IL (3870, 4870) should I choose to, and IL, though limited, works properly (unlike DP math on 4800 or 6900 in OpenCL for example).   CUDA (and very probably OpenCL on Nvidia) doesn't have such problems. 

 

My feeling is that for GPU computing to be really worthwhile one has to be able to get almost optimum performance from a given card within a short while of its appearing, say a month (this is one of CUDA's strengths).  The only way I can see this as possible with AMD is by using ISA, or something like it.  OpenCL or IL unfortunately aren't an option, for at least two reasons:  

 

1/  At present, the best AMD can do to help people when there is a problem/regression is to update the catalyst driver, but that takes months, after which your new GPU is getting old, or your old GPU is getting very old (consider for example an issue with burst writing I reported on 4800 cards around Christmas).   

 

2/ There is no control over compiler optimization (something that has been asked for for years).  For example, my IL matrix multiplication-like kernels have been messed up, using far too many registers and getting in a big tangle with reading data in inefficiently.  I thought this was partially to do with the VLIW5 nature of the older cards,  so when I heard about the newer VLIW4 cards I thought I'd try again.    However, that doesn't seemed to have made much difference; the ISA from IL is still a mess. 

 

On to ways of using IL in the future, why should one want to be poking about in an unsupported manner in IL images in an opencl binary?  That doesn't sound good.

Similarly, why should one be forced to fiddle with ISA images in a CAL binary?  I've only been doing this as a very last resort, trying to get somewhere close to the potential performance and so to justfiy using GPUs at all.  This shouldn't be the only option left to people.

 

Personally, I think that AMD (and Nvidia perhaps but as I mentioned their CUDA seems to do much better by itself) needs to provide a simplified "quasi-assembly" language that doesn't (necessarily) get optimized, supports the vector nature of registers, but sorts out all of the tedious stuff (code layout etc.), giving "what you see is what you get", and of course sorts out all the ancillary data hidden in the calprograminfo note.  For example, I'd like to be able to write things like:

...

LOAD(1) R1,R0.x

VECMUL R2,R1,R0.x     

VECADD R4,R3,R2

MUL R5.x,R4.y,R2.z

...

and have these expand into the appropriate VLIW clause with all the decorations.   With about 5 instructions one could do most of linear algebra almost optimally.  The way I could have done this myself would have been via a preprocessor feeding in to calAssembleObject() but the latter doesn't work...

 

I'm sorry if this seems a bit pessimistic but in the end, despite their potential, neither my 3870 nor 4870 got close to doing a useful calculation for me in my work; I am still hoping the 6950 doesn't fare similarly.  Perhaps there is a subtle difference between "stream computing" and the new "accelerated parallel processing", the latter being more flexible and needing a higher level language.  However, I don't mind spending some time  (though not as much as I have spent so far!) to get a simple kernel such as matrix multiplication or Cholesky factorization to work really well on the hardware I own. It'd be nice to be able to do so. 

What do you experts at AMD think?  Do you see what I am getting at?  Certainly at least some other users have expressed a similar view. 

Best wishes,

Steven.

  

0 Likes
adm271828
Journeyman III

Implications of the deprecation of CAL

Originally posted by: sgratton  

 

What do you experts at AMD think?  Do you see what I am getting at?  Certainly at least some other users have expressed a similar view.  

 

Hi Steven,

You are not alone, see here: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149806&enterthread=y

I think it's a pity that AMD is currently unable to make a clear statement about their software vision, especially regarding the problem of how to get the maximum performance from the hardware. I foresaw the deprecation of CAL+IL, even if at that time AMD was saying it would continue to support IL (not such a long time ago...). But I never imagined they would deprecate it without a working replacement solution to provide low level access to the hardware... which seems to be the case.

I appreciate Micah's effort to provide workarounds, but directly putting IL code into the ELF looks like a joke for any non-toy project. BTW: shall we put IL in text form into the .amdil section, or in token form in the second embedded ELF file that appears in the .text section for every kernel? 

Best regards,

Antoine

0 Likes
Jawed
Adept II

Implications of the deprecation of CAL

Originally posted by: sgratton For example, I can actually have a matrix on the GPU greater than 256 MB (still "small") rather than having to mess around segmenting it (though I still have to mess around to access it from the CPU), and can use the whole memory of the card (less 256 MB).


My OpenCL BLAS library has no problem with image support, using image2D buffers that are >256MB. I can create a single buffer containing many matrices, and multiply arbitrary matrices from within the buffer and write the results to the same buffer. (OpenCL spec says that this isn't supported but it works.)

Or I can take a 512MB matrix and square it and put the result in a new 512MB buffer. All on a 1GB card.

Basically this works the same as my CAL version did.

My feeling is that for GPU computing to be really worthwhile one has to be able to get almost optimum performance from a given card within a short while of its appearing, say a month (this is one of CUDA's strengths).


That's a pipedream.

The only way I can see this as possible with AMD is by using ISA, or something like it.


ISA is extremely complex. It's essentially impossible to write by hand and has to be generated. e.g. the execution pipeline has to be modelled by the compiler to obey register timing rules (VEC_012 etc.).

OpenCL or IL unfortunately aren't an option, for at least two reasons:   

1/  At present, the best AMD can do to help people when there is a problem/regression is to update the catalyst driver, but that takes months, after which your new GPU is getting old, or your old GPU is getting very old (consider for example an issue with burst writing I reported on 4800 cards around Christmas).   

 

2/ There is no control over compiler optimization (something that has been asked for for years).  For example, my IL matrix multiplication-like kernels have been messed up, using far too many registers and getting in a big tangle with reading data in inefficiently.  I thought this was partially to do with the VLIW5 nature of the older cards,  so when I heard about the newer VLIW4 cards I thought I'd try again.    However, that doesn't seemed to have made much difference; the ISA from IL is still a mess. 



I agree, these things are a perennial problem. I don't see any solution. My IL matrix-matrix code used to run at 1.75 TFLOPs (this is true multiplication, not one that relies upon A being transposed first, and supports arbitrary matrix sizes) but later Catalyst versions reduced this to ~1.4 TFLOPs.

The OpenCL version of my algorithm simply doesn't work as the compilation is erroneous (wasted tens of hours getting to the bottom of that - I haven't tested with SDK 2.4 yet). So I have to use a naive algorithm whose performance, incidentally, has decreased by a couple of hundred GFLOPs with SDK 2.4 and Catalyst 11.4.

On to ways of using IL in the future, why should one want to be poking about in an unsupported manner in IL images in an opencl binary?  That doesn't sound good.


Agreed. Plus the run-time interface of CAL is lost, which has a certain precision to it.

Similarly, why should one be forced to fiddle with ISA images in a CAL binary?  I've only been doing this as a very last resort, trying to get somewhere close to the potential performance and so to justfiy using GPUs at all.  This shouldn't be the only option left to people.


Having worked on things other than BLAS, I think what this boils down to is that most other people's applications are rarely going to get 1% of the optimisation effort that goes into making BLAS stuff work well.

It's worth noting it took a few years for matrix-matrix multiplication performance to get where it's supposed to be on NVidia - early attempts were laughably pitiful. So the idea that it's "easy" to just write an optimal matrix-matrix multiplication on AMD with some close-to-the-metal code is similarly laughable. My original "optimal" IL algorithm has ~8:1 ALU:TEX and the 1.75 TFLOPs it achieved was still a few hundred GFLOPs short of what the hardware is capable of, due to poor compilation.

AMD's OpenCL compilers are still trying to achieve correctness. At the same time the complexity of optimisation (GPR count versus instruction count versus VLIW woe) is causing absolute performance to vary. This effectively undermines any attempt at hyper-optimisation.

I suspect AMD's motive for removing CAL/IL support is a simplification of effort for the new chips as they arrive. The quality of the Cayman technical documentation is very poor, clearly indicating a rushed job.

0 Likes
MicahVillmow
Staff
Staff

Implications of the deprecation of CAL

adm271828,
Put it in text form in .amdil section and use the compiler options to strip out the other sections. Emitting your own OpenCL binary is actually very useful when targeting OpenCL from a non-OpenCL language. Think about a Fortran to OpenCL compiler. If you can emit your own binary, you wouldn't have to do source to source translation and can generate IL directly which gives the compiler more options for optimization control.

Jawed,
If you have issues with the Cayman docs, please let us know. I work with the documentation writer all the time to get issues fixed. Also, if you know of any performance regressions that we can test, I'd be interested in hearing about them. We try not to regress on performance, but unless we have a test binary, we can't gaurantee it.
0 Likes
clamport
Journeyman III

Implications of the deprecation of CAL

Where might I find examples of this?

Thanks!

0 Likes
sgratton
Adept I

Implications of the deprecation of CAL

 

Hi everybody,

 

Thanks for your comments.  Thanks Antoine for the link to the thread you mention.  There is lots of interesting relevant stuff there.  Jawed, you are of course right that it would be very tedious to have to deal with VEC_012 etc. by hand; this is something I too would like to have sorted out automatically!  However, control flow and TEX instructions are all pretty straightforward.  Something like unoptimized IL, respecting ordering of tex and alu instructions, would be good start.  I also agree that achieving optimal performance after 1 month is not likely to happen; rather I feel that the available tools should not stand in the way of this goal.

 

Perhaps I am out of date with OpenCL, having until recently been using 4800 class hardware:  Does one still have to adjust GPU_MAX_HEAP_SIZE to use all the memory on the card?  And what about the CL_MAX_MEM_ALLOC_SIZE setting?

 

Best wishes,

Steven.

 

0 Likes
Jawed
Adept II

Implications of the deprecation of CAL

Image buffers give you access to essentially all the memory, though a single buffer can't be as large as the card's memory. I've never used those environment variables.

The memory allocation size restriction problems people have are with normal buffers as opposed to image buffers.

Micah, some examples of errors in Revision 1.0 (February 2011) of HD 6900 Series Instruction Set Architecture:

  1. Section 2.3 erroneously includes "local data share" as a type of clause, a feature that is restricted to R7xx GPUs
  2. Table 2.7 says that an ALU clause contains 5 ALU_WORDs
  3. Section 4.3, second sentence, repeats the error relatig to 5 ALU instructions
  4. Section 4.4 doesn't present the algorithm it says it is presenting.
0 Likes