I'm just getting started on GPGPU programming.
I see a lot of bug reports on these forums, particularly to do with Brook+.
I'd be interested to know peoples opinions of programming in CAL (or if I were really desperate, IL).
How does the reliability of the product compare to Brook+?
How does the development time compare to Brook+?
Are there any other advantages/disadvantages to programming in CAL over Brook+?
Cheers
Charlie.
Hi Charlie,
Just to comment on my own experience:
I started with Brook several months ago, I found particularly attractive the syntax very similar to C and the easy way of quickly having a GPU application prototyped.
However, the start was difficult mostly (at this time) because of incomplete documentation and several major bugs in the compiler, the run-time and the drivers...
Having said so, I did enjoy developing some basic routines (ie I established a sort of tool box for further use). This upto a point in october/november time frame I decided to put development on hold, waiting for the new december SDK release since I couldn't stand anylonger dealing with pending bugs... blue screens, and others isgust;
When the current release of the SDK showed up, I found it more stable compared to the previous ones... but unfortunately some killing (still unexplained) bugs prevented my developments to be completed.
At this point; I hesitated, checked alternatives (ie CUDA) and then decided to go deeper in the Brook environment and moreover to dig into CAL since my understanding was that it would give me if not more control, at least a better understanding on what is going on under the hook.
My decision was a good one, I don't regret it!
Since then I'm programming in CAL except the kernels for which I'm using the brook compiler.
Of course a lot of preparation work that is automatically done in Brook, such as stream declarations, binding, compilation and so on... you've now to do it by yourself... so I've ended up in writing some quick tools to ease the task of coding in CAL especially for repetitive clerical tasks.
So basically certainly reinventing part of the work being done for Brook... But that's ok, at least I know what is going on...
Doing so, I not only was able to pursue my code development, but on top of this I've been able to gain a very substantial increase of performance.
I still have plenty of questions on the way to better use the CAL runtime environment, and do hope our friends at AMD will continue to provide better environments each release.
Bottom line:
- Yes: Brook is a good tool for quick prototyping
- Entering into CAL and still using Brook for kernel compilation is as of today the approach that appears the most satisfying
- Digging into IL is in my opinion not worth it, since you would spend a lot of time entering into the nitty gritty of the kernel code, and I'm quite sure our AMD friends are doing their best to optimize their GPU compilers.
This quickly relates my own experience, and I'd be interested in hearing from other users.
Regards from Paris
Jean-Claude
Jean Claude,
Your english is better than mine, et je suis d'Etats Unis. I'm currently working on a chemistry application on the GPU and comparing performance between various platforms (Brook+ and CAL are two of them) and what kinds of optimizations one needs to perform in order to improve performance. For this particular application, I have found the naive CAL kernel to be faster than the most optimized Brook+ kernel (This is only a measure of execution; data transfers are not included in this). I wholly agree with everything you said about understanding the underlying details of the system. For example, in CAL with a Firestream 9170, the maximum dimension any buffer can be is 8192, which is based on a hardware limitation. Brook+ allows for larger dimension streams, but knowing the previous fact means that Brook+ must generate code that maps a larger stream into a 2D stream or possibly multiple streams. Knowing this means that there is likely overhead incurred in performing this mapping.
In short, to perform optimizations on any platform, you need a robust understanding of that platform, and CAL provides a way to experiment and better understand the platform.
thesquiff,
Since you said that you are getting started, I would highly suggest Brook+. It is a higher level language compared to CAL and CUDA; and one of the easiest ways to pick up the GPGPU ideas.
Later, when you are beyond the getting started stage, you can try what jean-claude mentioned -- use CAL, with Brook+ kernels to generate IL.
Thank you for the fast and detailed responses. I'm sure I'll be back with more questions!
Hi Ryta,
Well, getting into CAL has given me more (somewhat) visibility on what is going on under the hook.
Since then I've enjoyed better performances mostly because I've been able to monitor closely memory xchanges between CPU & GPU, and task assignment. Depending on the task I would quantify it 10 to 20x.
So the improvement is partly due to better using the CPU...
On the other hand, going to CAL was the only way for me to escape this damn bug related to multi-loop kernel in Brook (I'm still expecting a fix for it BTW...)
Now, with respect to IL tweaking => I've been using the code generated by Brook as such. More precisely from time to time I'm having a look at what Brook generates and try and modify my Brook C level source in order to "help" the compiler producing better code.
Several times I've been horrified by what the Brook compiler generates, so I think there is still a lot of room for improvement for our AMD friens on this.
Neithertheless, I've no time and intention to dig into Il intricacies, I really think it could be a waste of time (and morever I'm not interested in working at register level on GPU). So let's cross our fingers and expect incremental improvement from next SDK releases.
Jean-Claude
CPU - GPU exchanges and synchronisation are definitively key to get the most of GPu co-computing.
My application is video processing and thus I'm grabbing a frame every 1/25th of second (Europe PAL yes...). Some pre and post processing have in any case to be performed by the CPU, this is why making sure the exchanges are fast (even hidden through DMA) and that a parrallel activity of CPU and GPU can be exploited to its maximum is vital!
With respect to multiloop bug, I've not tried "output.error" stuff since I've already moved to CAL prior this suggestion occured. In any case, I don't think this fix is satisfying since once again minimizing the CPU GPU interactions is the name of the game. You can't simply have your CPU asking what is going on... the synchronisation scheme is bound to be event based!