9 Replies Latest reply on Aug 21, 2009 4:52 PM by ryta1203

    High Level Explanation

    ryta1203

      First, this is not a critique, this is a question.

      Why is AMD's GPGPU solution so much harder to explain than NVIDIA's GPGPU solution?

      The CUDA Programming Guide is straightforward and easy to read. It gives a high level overview of what's happening but at the same time offers good optimization options without a lot of "it depends".

      I'm curious about what is so different about AMD's GPGPU solution that they can't do the same?

        • High Level Explanation
          ryta1203

          It's just that it seems that documentation is really what's olding AMD's GPGPU solution back.

          The hardware is there, that's for sure but if no one knows how to code for it in a relatively decent way then it's no use and no good.

          Reading the documentation it's like the person who wrote it assumes we already know things that we don't know.

          The lds_transpose docs are meant to be read with the code. Why can't we just get a "generic" documentation on how to use compute shader mode??

           

            • High Level Explanation
              riza.guntur

              How if... We build one?

              Or maybe create one blog with best practice we already know

                • High Level Explanation
                  ryta1203

                  My problem with that is that I want documentation for things I don't know, if I know it then it's probably already documented well.

                    • High Level Explanation
                      riza.guntur

                      By the way ryta1203, how big is the speed up you gain from optimization you've done up to now? I'm curious, is the difference that big until you want to dig information so deep?

                        • High Level Explanation
                          ryta1203

                          I have learned A TON of stuff (that wasn't in the docs) just through this forum (both AMD employees, such as Micah and the rest, and also from fellow non-AMD posters).

                          One of my apps was some LBM code. I coded it in Brook+, my naive approach garnered me ~11x speedup.... I kept optimizing it and over a few weeks I had ~26x speedup, so actually quite a difference from my naive approach.

                          My naive approach was non-vectorized code. I did so good speedup simply from vectorization but I also saw good speedup from other things as well, including splitting kernels, reducing branches where possible, combining kernels, using pinned memory, reducing register pressure, reducing transfer, doing more work per fetch, code reordering (to better utilize compiler VLIW optimizations) and probably most importantly: PUTTING THE PROBLEM INTO THE STREAMING MODEL.

                          The last was really the most important I think (and vectorizing of course). Rather than trying to force the CPU code onto the GPU (which I see people here try to do quite often) I rearranged the problem to fit the GPU. This was very different than the approach one might take in CUDA, IMO, and did take me some time to wrap my head around (as do most things).

                          So, yes, I have seen quite some improvements. I also have other motivations that are not quite so app-specific.