Archives Discussions

HScottH · ‎07-18-2008

What are the fundamental differences between these approaches to GPGPU?

Hi,

I've just attended a lecture by Dr. David Luebke from nVidia and learned about CUDA and their approach to GPGPU.

I asked some questions about level of support, etc., and have some follow-ons I'd like to ask here.

Clearly, the popular languages have great value in Science, Engineering and Academia. But, it still looks like the best approach for building consumer-level GPGPU software for the market of heterogeneous GPU's is either DirectX or OpenGL. This approach can provide support for most cards made after about 2002 (accepting, of course, the performance and accuracy differences).

But, I wonder if there are disadvantages to this approach, other than a slightly higher level of development complexity.

Does Brook or CAL interact with the hardware in ways that are unattainable from a graphics API? Are there hardware features or optimizations that I cannot access through the graphics API's?

Any related thoughts or ideas would be appreciated 🙂

Scott

ryta1203 · ‎07-18-2008

I'm just a user but my thoughts are:

1. No need to learn a graphics API
2. Can program in direct GPU ISA (does it get any closer to the hardware than that?)
3. Brook+: as far as I know (limited knowledge) it sits on CAL, which sits on GPU ISA and it is optimized for specific hardware, thus generally resulting in better results, this is true for CUDA also (optimized for Nvidia hardware)

I think that if you are putting together a cluster and have little to no money and don't care what the GPUs are (mix-match cheap gfx cards/chips) then sure, OpenGL/DX might be a very good solution, maybe even RapidMind (don't know much about that); however, graphics cards are fairly cheap nowadays (thanks in large part to AMD's new cards) and for ~$200 you can get theoretical peak performances of ~1TFLOPs, that's pretty good for just ~$200.

Yes, OpenGL/DX is platform independent but there is a lot of overhead involved and if you are building a solution (such as a cluster for a lab) and you know you want to go GPGPU then why not pick a vendor and go in that direction?

The only real problem I see, currently, is that CUDA is light years ahead of Firestream in functionality, simplicity, documentation and support.

Someone correct me if I mispoke and feel free to disagree.

HScottH · ‎07-18-2008

Thank you for your thoughtful response.

To be clear, my question is really whether or not the custom languages offer better "maximum" performance because of how they are able to interact with the card. Assuming I'm a Ninja DX and HLSL coder, could I get the same FLOPS out of the card as someone using CAL?

My goal is to understand the trade-offs between the approaches, and not so much to build a single system or cluster. I'd like to be able to build consumer-level, off the shelf software that makes use of the GPU's installed in my customer's computers. I know this is done today (such as playing games, physics, FAH, etc.), but wonder if there is a loss of performance using the more "generic" interface.

ryta1203 · ‎07-18-2008

Originally posted by: HScottH

Thank you for your thoughtful response.

To be clear, my question is really whether or not the custom languages offer better "maximum" performance because of how they are able to interact with the card. Assuming I'm a Ninja DX and HLSL coder, could I get the same FLOPS out of the card as someone using CAL?

My goal is to understand the trade-offs between the approaches, and not so much to build a single system or cluster. I'd like to be able to build consumer-level, off the shelf software that makes use of the GPU's installed in my customer's computers. I know this is done today (such as playing games, physics, FAH, etc.), but wonder if there is a loss of performance using the more "generic" interface.

Well, as I stated, how can you get more optimized ("maximum performance") than getting right at the ISA? I stated this in my first response, I just don't think you will be able to get that kind of tweakability from DX and a shader lang.

MicahVillmow · ‎07-18-2008

HScottH,
Brook+ provides double and scatter support which, to my knowledge, is not possible from a graphics API w/ pixel shaders. I'm not sure about geometry or vertex shaders, but why spend the time learning them when it is easier to just use pixel shaders. Brook+ does not have backends for OGL/DirectX, however, the original Brook works on DirectX and OpenGL, such that it works on more graphics cards than CUDA or CAL supports. Therefor if you use the core Brook+ language without using any of the AMD extensions that we implemented, then you can write code that works on many cards.

I mentioned double and scatter support, but there are many other aspects that you get when programming in Brook+/CAL that you don't get in OpenGL/DirectX. Things such as more resources dedicated to your process since CAL doesn't use Vertex Shaders or Geometry shaders, resources don't need to be allocated for them. You also get low level access to the graphics card via CAL where you can write screaming fast kernels. The current fastest matrix multiplication kernel that I know of is written in IL using CAL, getting 210+ gflops kernel performance on a HD3870. I don't think you are able to get that with a graphics interface and HLSL, or even on CUDA on an equivalent card.
Brook+/CAL also give you explicit control over memory and how it is allocated and used, which can be a boon or a bane, depending on how it is developed.

CAL also provides a utility library to use AMD HLSL, which is HLSL w/ extensions for our graphics card features such as double and scatter, so you can use your HLSL knowledge and the CAL layer without having to use IL or Brook+.

So we do provide methods of writing code that works on multiple graphic cards of many generations, but the extensions we provide cannot be utilized or the code will not work on older cards.

We try to provide all the layers of the software stack from high level down to ISA level depending on the needs of the application and the programmer. At every level most of the graphics aspects are hidden from the programmer to provide an easier to program interface.

HScottH · ‎07-18-2008

Thank you, Micah. You answered my question thoroughly.

One more question though: Is there a way to transfer data from a CAL app to a DX app without having to go through the PCI-e bus?

I mean, if I'm doing a physics simulation, 95% of my GPU work will be model math. But when that's done, I'll have a buffer that I need to use for rendering. Can I do that without using CAL to copy down to RAM, then pick it up with DirectX and push it back up?

Thank you very much.

Scott

MicahVillmow · ‎07-18-2008

Scott,
Although the current SDK does not have CAL/DX interoperability, it is something we want to make available to the users as we believe it is a useful feature. From my understanding of the current situation, there needs to be a copy across the PCI-e bus, but in a future SDK release there should be interoperability which removes this step.

HScottH · ‎07-19-2008

Thank you Micah and Ryta.

@ Micah: I've achieved > 200 GFlops through DX on some experiments I did (HD 3870 card). I was disappointed that the performance didn't match what the Shader Analyzer predicted, but after researching a bit have concluded that my results were pretty good. Unfortunately, I want to try the CAL interface, but I'm running Vista 64 and it seems the SDK hasn't made it that far yet. Are there any alternatives for me?

I understand the loss of double support; no graphics API yet supports this. But--sorry for the naivety here--what is Scatter?

@ryta: I presume when you speak of 1 TFlop for $200, you're speaking about the new 4850? I've been struggling over whether to buy that, or the 4870. I wonder if you know: is the extra bandwidth really that important between DDR5 and DDR3?

In general: I've been an ATI fan and loyalist since the Rage Pro days, and I enjoyed the talk from nVidia today. I asked about FP64 support (which ATI/AMD has have since launch of the 3xxx series, but they will only gain with the release of the GeForce 10), and about PCI-E 2.0 (which, again, they won't have until the GeForce 10). They are touting around 1 TFlop but it's on their ~$600 cards, whereas ATI [theoretically] gets that job done for < $300.

I'm dieing to see what synergy comes from the 2nd CPU manufacturer (I believe Intel clearly #1) and the #1 GPU manufacturer (ATI) merging, especially now that the clocks ceiling has so changed the industry.

ryta1203 · ‎07-19-2008

Scott,

As far as I am aware GDDR5 is ~2x GDDR3, that is my understanding. As far as the bandwidth being "important", well, that's really for you to decide. I would say yes.

Yes I was speaking about the 4850. I believe that the 4870's theoretical peak performance is ~1.2 TFLOPs. Someone please correct me if I am wrong.

We are waiting for the 4870x2. I have a 4850 in my personal computer.

This question might get more interesting responses (opinions) over on gpgpu.org forums.

Archives Discussions

DX9, OpenGL, CAL, etc.