12 Replies Latest reply on Nov 26, 2008 11:32 PM by udeepta@amd

    What will be  changed in brook+ SDK 1.3 ?

    kos
      Will it have better IL generation ?

      Will it have better IL generation ?

        • What will be  changed in brook+ SDK 1.3 ?
          MicahVillmow
          kos,
          The new IL generator in brook+ 1.3 will generate a lot cleaner kernels that can be dropped into a CAL program.
          • What will be  changed in brook+ SDK 1.3 ?
            ryta1203
            If Brook+ generates better kernels why would you need to "drop" it into CAL? Why can't you just use it with Brook+?

            Is AMD still interested in producing a USABLE high level GPGPU solution?

            Will Brook+ 1.3 allow for local arrays?
            • What will be  changed in brook+ SDK 1.3 ?
              MicahVillmow
              Ryta,
              You can still use it in brook, but a generic high level solution like Brook+ will never be able to outperform an optimized low level implementation tailored to a specific problem. See slides below from a presentation we gave at PLDI`08. With the previous few iterations of Brook+ it was very hard to take a brook IL kernel and then use it in a CAL program. This release should improve on this for the people who want to do so. This gives a higher level way to test and verify correctness of an application before getting close to the metal and optimizing it. This can be done at the Brook+ level, but is more difficult as you loose some direct control over your code when going through the Brook+ compiler.

              The slides show matmult from naive to optimized IL starting at slide page 39. The performance difference between an optimized brook+ and equivalent IL w/ CAL is fairly significant.
              http://www.research.ibm.com/pe...-tutorial-program.htm
              • What will be  changed in brook+ SDK 1.3 ?
                ryta1203
                Micah,

                Yes, I understand what you are saying but there's no reason for me to use CAL and take months to code my multi-kernel solution when I can take 1-2 days and do it in a high level language, especially the GPU (CUDA) solutions take only a minute or two to run (depending on size).

                I can understand that for VERY long runtimes how this speedup would be useful, I'm just saying that the development/performance ratio is sometimes not that good for CAL (in my case it's not really that good). Kernels in CAL take MUCH longer to debug then do kernels in Brook+ which adds even more time to development.

                So am I to assume from your statement that AMD is not interested in producing a usable high level GPGPU solution? And will Brook+ 1.3 support local arrays?

                What I would really be interested to see is a comparison between CUDA and CAL for speed and development time on similar-cost GPUs. All these groups keep putting CPU versions out there (and not really very multi-core ones at that) and compare the GPU solutions to that. I honestly think this is old hat. Does anyone not realize the benefit of a MatrxMul on GPU vs. CPU by now?? Why keep rehashing the same old same old?

                EDIT: In the scientific community not everyone cares about squeezing out every last drop of performance, most just care about getting results in a timely fashion.

                If it takes an extra 3 months to code and my program went down from 10 minutes in a Brook+ solution to 5 minutes in a CAL solution, is that really worth it?

                I totally understand that for VERY long runtimes using CAL makes great sense.
                • What will be  changed in brook+ SDK 1.3 ?
                  MicahVillmow
                  Ryta,
                  AMD is very interested with further improving our high level GPGPU solutions and would gladly welcome feedback on it, however, it must come with the understanding that not all feedback or features requested make it into the next release. Local Array support in Brook+ has been requested for a future release, but it did not make it into Brook+ 1.3. If you do hear about anyone doing comparisons between our SDK and the competition, I'd be interested in knowing about it as we can use it to further improve our SDK.

                  As for matmult, it usually is one of the first things shown mainly because it is such a well understood problem and can easily be compared to other platforms and understood by people who do not have experience with the newer platforms.
                    • What will be  changed in brook+ SDK 1.3 ?
                      sgratton

                      Hi everybody,

                      On the CAL/cuda front, I've basically been attempting to port my cuda Cholesky factorization program to AMD cards with CAL/IL, hoping for a significant performance improvement, especially in double precision. As it turns out it has taken significantly longer to get working and so far the performance is not competitive: my cuda routine does 100 GFLOP/s SP and 40 GFLOP/s DP on a GTX260, whereas the best IL one so far does 30 GFLOP/s SP and I haven't even tried DP on a 1GB 4870. The algorithm requires lots (but not prohibitive) amounts of memory access, read and write, and I think the key problem is that AMD have not released sufficient information to date on how to access memory effectively and reuse it 100% effectively (via the caches), both within and across wavefronts, making an informed implementation impossible. Nvidia cards with their straightforward uncached global memory and register-quick shared memory seem to "just work" much better in this regard. My overall impression so far is that the AMD setup is "fussier" and so it is much harder to achieve really good performance. (Take for example global buffers: you can only have one, from a kernel you can't in general read from it efficiently, and from linux 64-bit at least, it has to be at least 256MB smaller than the size of memory on the card and you can't easily access it from the cpu if it's greater than 255MB. With cuda I can simply declare and use any number of arrays, and a single one can be up to 50MB of the total memory of the card.) This is not to say I haven't had problems with Nvidia -- I have noticed significant (factor of 2) slowdowns at certain matrix sizes related (I think) to the -undocumented- way in which memory is split between the multiple 64-bit channels of their cards, and also seen a significant (50%) improvement by pre-"blocking" the matrix in global memory -- perhaps related to -undocumented- paging effects.

                      In fact I think that both companies need to provide much more optimization information to make using gpu's worthwhile even as part of a supercomputer. To beat a modern multiway multicore cpu node (especially if you can exploit a vendor-supplied library...) the gpu's have to be doing close to optimal, and the latter is very hard to achieve at present for a non-"pure streaming" application.

                      From some of Micah's comments in other threads though I am hopeful for significant documentation and compiler improvements shortly so plan to try the CAL/IL option again then!

                      Best,
                      Steven.

                    • What will be  changed in brook+ SDK 1.3 ?
                      ryta1203
                      Micah,

                      I'm sorry to hear about no local array support, again, this really makes the next release useless to me still. I will let you know if I see any Brook+/CUDA comparisons but I doubt it will happen anytime soon since Brook+ is not all that usable. I would not consider a CAL/CUDA comparison very good since they are different beasts altogether.

                      If anyone knows of a practical workaround for the absence of local array support in Brook+ please let me know.
                      • What will be  changed in brook+ SDK 1.3 ?
                        MicahVillmow
                        kos,
                        One of the things that is improved is mapping of constant buffers.
                        Using bitonic_sort as an example:
                        bitonic(float input[], out float output<>, float stageWidth, float offset, float twoOffset)

                        In the current codegen, it is fairly difficult to tell which constant buffer is which, but in the new codegen it explicitly states which buffer is used for program constants and which one is used by the runtime for address manipulation.
                        • What will be  changed in brook+ SDK 1.3 ?
                          MicahVillmow
                          sgratton,
                          I don't think the doc will make it into the upcoming release. I'm still working on it.