CAL/IL is a more mature development API but also a lot lower and more difficult to both debug and use. CAL and IL go hand in hand. There is currently no other way to program with IL except with CAL and CAL currently accepts either ISA or IL.
Brook+ is a high level development model that abstracts away much more of the hardware/driver interface and handles much of the setup code for you in the Brook+ runtime/compiler.
That being said, CAL programs take longer and are more difficult to develop, but is currently the more mature stack. Brook+ development has undergone some major work in the past few months and is constantly improving.
CAL also has the advantage of getting new hardware features implemented first and can be used to get the maximum performance out of the hardware, but again at the cost of implementation and debugging time.
I'm sure others can pipe in with more details.
Just to comment on my own experience:
I started with Brook several months ago, I found particularly attractive the syntax very similar to C and the easy way of quickly having a GPU application prototyped.
However, the start was difficult mostly (at this time) because of incomplete documentation and several major bugs in the compiler, the run-time and the drivers...
Having said so, I did enjoy developing some basic routines (ie I established a sort of tool box for further use). This upto a point in october/november time frame I decided to put development on hold, waiting for the new december SDK release since I couldn't stand anylonger dealing with pending bugs... blue screens, and others isgust;
When the current release of the SDK showed up, I found it more stable compared to the previous ones... but unfortunately some killing (still unexplained) bugs prevented my developments to be completed.
At this point; I hesitated, checked alternatives (ie CUDA) and then decided to go deeper in the Brook environment and moreover to dig into CAL since my understanding was that it would give me if not more control, at least a better understanding on what is going on under the hook.
My decision was a good one, I don't regret it!
Since then I'm programming in CAL except the kernels for which I'm using the brook compiler.
Of course a lot of preparation work that is automatically done in Brook, such as stream declarations, binding, compilation and so on... you've now to do it by yourself... so I've ended up in writing some quick tools to ease the task of coding in CAL especially for repetitive clerical tasks.
So basically certainly reinventing part of the work being done for Brook... But that's ok, at least I know what is going on...
Doing so, I not only was able to pursue my code development, but on top of this I've been able to gain a very substantial increase of performance.
I still have plenty of questions on the way to better use the CAL runtime environment, and do hope our friends at AMD will continue to provide better environments each release.
- Yes: Brook is a good tool for quick prototyping
- Entering into CAL and still using Brook for kernel compilation is as of today the approach that appears the most satisfying
- Digging into IL is in my opinion not worth it, since you would spend a lot of time entering into the nitty gritty of the kernel code, and I'm quite sure our AMD friends are doing their best to optimize their GPU compilers.
This quickly relates my own experience, and I'd be interested in hearing from other users.
Regards from Paris
Your english is better than mine, et je suis d'Etats Unis. I'm currently working on a chemistry application on the GPU and comparing performance between various platforms (Brook+ and CAL are two of them) and what kinds of optimizations one needs to perform in order to improve performance. For this particular application, I have found the naive CAL kernel to be faster than the most optimized Brook+ kernel (This is only a measure of execution; data transfers are not included in this). I wholly agree with everything you said about understanding the underlying details of the system. For example, in CAL with a Firestream 9170, the maximum dimension any buffer can be is 8192, which is based on a hardware limitation. Brook+ allows for larger dimension streams, but knowing the previous fact means that Brook+ must generate code that maps a larger stream into a 2D stream or possibly multiple streams. Knowing this means that there is likely overhead incurred in performing this mapping.
In short, to perform optimizations on any platform, you need a robust understanding of that platform, and CAL provides a way to experiment and better understand the platform.
Since you said that you are getting started, I would highly suggest Brook+. It is a higher level language compared to CAL and CUDA; and one of the easiest ways to pick up the GPGPU ideas.
Later, when you are beyond the getting started stage, you can try what jean-claude mentioned -- use CAL, with Brook+ kernels to generate IL.
Thank you for the fast and detailed responses. I'm sure I'll be back with more questions!
Brook+ is easier to use and less development time than coding kernels in IL. If you are coding a complex application with many kernels that require large amounts of computation, IL is going to take a long time to develop and be very hard to debug (it's essentially assembly, though not really). Even in Brook+, large problems with multiple kernels is not an easy task.. especially to gain optimum performance.
1. udeepta, I'd be interested to know what makes Brook+ a higher level language than CUDA?
2. jean-claude, you have seen significant performance increases just from using CAL without tweaking the IL kernels generated by Brook+?
Well, getting into CAL has given me more (somewhat) visibility on what is going on under the hook.
Since then I've enjoyed better performances mostly because I've been able to monitor closely memory xchanges between CPU & GPU, and task assignment. Depending on the task I would quantify it 10 to 20x.
So the improvement is partly due to better using the CPU...
On the other hand, going to CAL was the only way for me to escape this damn bug related to multi-loop kernel in Brook (I'm still expecting a fix for it BTW...)
Now, with respect to IL tweaking => I've been using the code generated by Brook as such. More precisely from time to time I'm having a look at what Brook generates and try and modify my Brook C level source in order to "help" the compiler producing better code.
Several times I've been horrified by what the Brook compiler generates, so I think there is still a lot of room for improvement for our AMD friens on this.
Neithertheless, I've no time and intention to dig into Il intricacies, I really think it could be a waste of time (and morever I'm not interested in working at register level on GPU). So let's cross our fingers and expect incremental improvement from next SDK releases.
If you are using Brook+ for kernels and you get a 10-20x from just using CAL for memory transfers then it seems that CPU->GPU/GPU->CPU transfers are a large part of your code, no?
For me, they are a very small part so I'm not sure that using CAL (without tweaking the kernels) would impart much of a performance increase.
ALSO, A question for you: What Brook+ related multi-loop kernel bug?
I am currently iterating over my kernels 1000+ times and it seems that "output".error() fixes the leaks/whatever. Have you noticed something else?
CPU - GPU exchanges and synchronisation are definitively key to get the most of GPu co-computing.
My application is video processing and thus I'm grabbing a frame every 1/25th of second (Europe PAL yes...). Some pre and post processing have in any case to be performed by the CPU, this is why making sure the exchanges are fast (even hidden through DMA) and that a parrallel activity of CPU and GPU can be exploited to its maximum is vital!
With respect to multiloop bug, I've not tried "output.error" stuff since I've already moved to CAL prior this suggestion occured. In any case, I don't think this fix is satisfying since once again minimizing the CPU GPU interactions is the name of the game. You can't simply have your CPU asking what is going on... the synchronisation scheme is bound to be event based!
Yes, it seems that you are doing quite a bit of transferring for your application, so using CAL directly is obviously saving you quite a bit of time. This is a good thing for anyone who is doing what you are doing to consider. Thanks for the input.
From what gaurav has said, and from what I have noticed, the "output".error() call does not really include any overhead (obviously some, but not that much at all). I saw no timing increases when adding more of these to my code. That said, I agree that this shouldn't be the case and it is a bug and needs to be fixed.