- Low level hardware access
- Tuning Ability (assembler)
- Low latency comparing to MS D3D (my own opinion)
- Brook+ have some bugs, I'm not using that for year, something was changed but I prefer using my abstraction layer on CAL lib and HLSL.
-With CAL you can use all power of hardware but it will waste more yout time comparing to CUDA.
My lib described here http://justanotherblog565.blogspot.com/2009/01/blog-post_12.html Blog in russian, but I think that C++ is international language))).
1. low level kernel coding (CAL)
2. high level kernel coding (Brook+)
3. Only current non-graphics (I say that loosely for AMD) GPGPU solution for AMD cards.
1. Brook+ and CAL seem to be buggy, Brook+ more than CAL.
2. IL/ISA kernels are difficult to code in, Brook+ is easier but Brook+ is much harder to get good performance in.
3. AMD insists on sticking to graphics terminology but is slowing coming over to compute terminology.
4. Streaming environment makes them unnecessarily difficult and has limitations other solutions don't have.
5. AMD has not yet given a clear picture of their hardware/performance model, making it difficult to optimize.
6. Lacks proper tools for optimization.
Thanks everyone. I've been working with GPGPU applications for a while now, but I haven't made it into coding anything serious yet.
I guess development without debuggers, etc. makes it a not as attractive an option. That and if the high-level language has a few bugs, it makes it harder for 'casual' coders to get into it, as the lower-level CAL won't be as easy to use.
I get the impression though that the Ati hardware is more capable. I heard on the forums that there are more double precision units on the ati cards, but I don't know if that's true or not.
I still want to give Brook+ a go anyway, but at least now thanks to you guys I have some idea about how difficult it may be. To hijack my own thread, can I ask why only compiles to c++ code though? Could you have that as an optional flag, and just call g++ underneath (or whichever c++ compiler you're more familiar with)?
Originally posted by: BlaizeI get the impression though that the Ati hardware is more capable. I heard on the forums that there are more double precision units on the ati cards, but I don't know if that's true or not.
That's definitely true. The latest nvidia cards (GT200) can generate only 30 double results per clock, but a RV770 (HD48x0) can do 160 double results per clock. With adds there should be the potential to get even 320 results (two ADD_64 per VLIW unit), but somehow the current compiler does not pack two ADD_64 in one ALU clause (but it is supposed to be possible according to the documentation). This equals a peak performance of 240GFlops for either MADD_64 or ADD_64 instructions and 120GFlops with pure multiplications on a HD4870.
On my kernels I get about 150GFlops throughput with doubles on a HD4870, a GTX280 has a theoretical peak performance of only 78GFlops. That is less than a former generation RV670 (HD38x0) can do.
That's pretty neat. With that sort of power, I wonder why there isn't a larger developer community behind the Stream SDK.
Could you point me in the direction of where you found out about the double results per clock please? It's not that I don't believe you (I don't ), but I'm interested in looking at the other features of the 4800 series and GTX200 series cards.