Archives Discussions

ryta1203 · ‎08-19-2009

First, this is not a critique, this is a question.

Why is AMD's GPGPU solution so much harder to explain than NVIDIA's GPGPU solution?

The CUDA Programming Guide is straightforward and easy to read. It gives a high level overview of what's happening but at the same time offers good optimization options without a lot of "it depends".

I'm curious about what is so different about AMD's GPGPU solution that they can't do the same?

ryta1203 · ‎08-20-2009

It's just that it seems that documentation is really what's olding AMD's GPGPU solution back.

The hardware is there, that's for sure but if no one knows how to code for it in a relatively decent way then it's no use and no good.

Reading the documentation it's like the person who wrote it assumes we already know things that we don't know.

The lds_transpose docs are meant to be read with the code. Why can't we just get a "generic" documentation on how to use compute shader mode??

riza_guntur · ‎08-20-2009

How if... We build one?

Or maybe create one blog with best practice we already know

ryta1203 · ‎08-20-2009

My problem with that is that I want documentation for things I don't know, if I know it then it's probably already documented well.

riza_guntur · ‎08-20-2009

By the way ryta1203, how big is the speed up you gain from optimization you've done up to now? I'm curious, is the difference that big until you want to dig information so deep?

ryta1203 · ‎08-20-2009

I have learned A TON of stuff (that wasn't in the docs) just through this forum (both AMD employees, such as Micah and the rest, and also from fellow non-AMD posters).

One of my apps was some LBM code. I coded it in Brook+, my naive approach garnered me ~11x speedup.... I kept optimizing it and over a few weeks I had ~26x speedup, so actually quite a difference from my naive approach.

My naive approach was non-vectorized code. I did so good speedup simply from vectorization but I also saw good speedup from other things as well, including splitting kernels, reducing branches where possible, combining kernels, using pinned memory, reducing register pressure, reducing transfer, doing more work per fetch, code reordering (to better utilize compiler VLIW optimizations) and probably most importantly: PUTTING THE PROBLEM INTO THE STREAMING MODEL.

The last was really the most important I think (and vectorizing of course). Rather than trying to force the CPU code onto the GPU (which I see people here try to do quite often) I rearranged the problem to fit the GPU. This was very different than the approach one might take in CUDA, IMO, and did take me some time to wrap my head around (as do most things).

So, yes, I have seen quite some improvements. I also have other motivations that are not quite so app-specific.

ryta1203 · ‎08-20-2009

In case you were wondering, here's the thread, I actually ended up getting 28x speedup, which was pretty good and that was on the 4850. Sadly, the CUDA comparisons are from two different machines so it's not an exact comparison, but it does give some idea.

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=101209&highlight_key=y&keyword1=Br...

riza_guntur · ‎08-20-2009

Yes, those definitely what they don't include in their docs. Maybe that's why they sell Firestream model, to give royal costumer better docs (If I sell something that expensive, I will).

Those vector operations definitely hard to implement, once it is completed mostly we can only say: "This will benefit from huge domain size." Since kernel calls really takes a lot of time.

You know, I think they should add to docs is how to do faster kernel call [VERY IMPORTANT]. I remember MicahVillmow once write about this. Don't you think so?

Steps for novice in C++ programming also will do the trick. Those are their homework.

The rest... IDK

riza_guntur · ‎08-21-2009

ryta, what happen when you write code in CAL? Is it even faster? How much more speed up?

ryta1203 · ‎08-21-2009

I didn't write that code in CAL/IL, it was way too complex and the end results wasn't really my goal, sorry.

Archives Discussions

High Level Explanation