14 Replies Latest reply on May 10, 2010 3:14 PM by godsic

    ATI STREAM AND .NET

    godsic
      about ATI STREAM counterpart for .NET JIT compiler

      Have the AMD planning to add the ATI Stream counterpart to the Microsoft .NET JIT compiler by AMD CAL?

      Did AMD ever internally discuss this possibility?

      My personal opinion is that approach should simplify the movement to the FUSION concept (since for high level .NET part there will not be a difference between any FPUs), speed-up current HPC .NET applications  and simplify the programmers life

        • ATI STREAM AND .NET
          godsic

          Just to clarify.

          Now MS .NET JIT compiler can only produce native code for CPU, however JIT approach effectively can be used to generate native code for any FPUs such as CPU, GPU, DSP etc. Therefore, .NET can be a good alternative to the OpenCL on Windows platform, since DirectCompute are more game oriented.

          Thus, is it possible to add an ATI CAL compiler to the MS .NET JIT compiler?

          ATI/AMD can introduce low-level managing class to MS .NET builded on the top of ATI CAL and (or) add the ATI CAL version of the common .NET classes such as Array or Tasks?

          Therefore, even existing application will gain a speedup without any developing costs. Since for more advanced task one can use this low-level wrapper.

           

           

            • ATI STREAM AND .NET
              genaganna

               

              Originally posted by: godsic Just to clarify.

               

              Now MS .NET JIT compiler can only produce native code for CPU, however JIT approach effectively can be used to generate native code for any FPUs such as CPU, GPU, DSP etc. Therefore, .NET can be a good alternative to the OpenCL on Windows platform, since DirectCompute are more game oriented.

               

              Thus, is it possible to add an ATI CAL compiler to the MS .NET JIT compiler?

               

              ATI/AMD can introduce low-level managing class to MS .NET builded on the top of ATI CAL and (or) add the ATI CAL version of the common .NET classes such as Array or Tasks?

               

              Therefore, even existing application will gain a speedup without any developing costs. Since for more advanced task one can use this low-level wrapper.

               

              Godsic,

                        I am sure this is definitely  not a AMD responsibility.  .NET compiler people can easily do this. of course it is a lot of work.

                       Things to be done in .NET compiler

                          1.  Add language constructs in .NET to show parallelism(simpler to OpenCL or OpenMP)

                          2. Generate equivalent code AMD CAL or CUDA

               

               

                • ATI STREAM AND .NET
                  godsic

                  As far as I know .NET Framework is a "black box" for the end-user and therefore only Microsoft and probably Microsoft partners can access its code etc. Therefore, I'm just wondering about the collaboration between the AMD and Microsoft on this front.

                  Definitely, every reasonable class of the .NET can be overridden by the user, however only AMD exactly knows how their hardware works and can push the performance further in compare to the one realization.

                  Moreover, since GPU/CPU will be hidden for the typical application it will be relatively easy to support new hardware without application code modification.

                  It is clear that both AMD and GGNVIDIA slightly change the architecture of their GPUs from generation to generation which makes programmers life terrible, since one need to profile the code from the scratch based on the new architecture properties.

                  Therefore in terms of money it is much easier to buy saying 4 x Opteron (Xeon) box which can outperform AMD/GGNVIDIA GPU in terms of DP performance (which is crucial for scientific applications) without any specific optimizations, since GPU raw performance (magic numbers) can be a bit higher.

                  Definitely, 4 X Opteron or 4 X Xeon are much expensive in compare to the raw price of the saying HD5870, however software developing costs for GPU platforms easily exceed similar parameter for the CPU and therefore GPU HPC seems to be much expensive in compare to the CPU HPC.

                  Therefore, I think it will be good for AMD to invest some $$$ in the real stuff rather then open standards such as OpenCL. Moreover, Windows platform are become more popular from day to day for HPC and therefore such topic are quite actual now.

                    • ATI STREAM AND .NET
                      hazeman

                       

                      Originally posted by: godsic

                       

                      Therefore, I think it will be good for AMD to invest some $$$ in the real stuff rather then open standards such as OpenCL. Moreover, Windows platform are become more popular from day to day for HPC and therefore such topic are quite actual now.

                       

                      ATI for years now can't do proper Linux driver. CAL compiler is in real need of overhauling. Creating shabby OpenCL compiler took them a year and is still missing a lot of features ( and has a lot of bugs ). 

                      And you really want them to invest $$$ into some .NET sollution ???

                      Beside it simply doesn't make sense. What for use closed and proprietary microsoft JIT compiler when LLVM is available ( and already used by ATI, NVidia & Apple for OpenCL ).

                      If you think that using JIT compiler would solve problem of programing GPU you are badly mistaken. GPU architecture is so different that usually you need heavily modified version of algorithm to achive high performance. It's not possible for any compiler to make such optimizations. Specially ATI architecture ( 5d vector ops + many limitations) is much harder to achive high utilization.

                        • ATI STREAM AND .NET
                          godsic

                          Ok, I'm with you about OpenCL. I think that OpenCL is just a bottleneck between the GPU power and programmer.

                          However, I strongly disagree with you about JIT and GPU.  Definitely, SPMD GPU approach is a bit different from the CPU, however GPU itself is a linear device with quite predictible behaviour.

                          Additionaly, SPMD paradigm are quite common for HPC (for example MPI) and from that point of view there is no any difference between the GPU and CPU. The only difference is the hardware realization. However, I think that memory management routines (such as automatic GPU memory mapping) can be realized "in metal".

                          Therefore there is no any reason why programmer should make this profiling work, while it is quite routine and distructive.

                          Moreover, ATI/AMD can even write a virtual GPUs frameworks (which will be good from the point of testing too) and their compiler can test the shaders performance in real time.

                          Another problem is the rendering oriented nature of the GPUs. Vendors trying to make some hybrid GPU/GPGPU devices, while they can separate this brunches and focus on specific optimizations.

                           

                           

                            • ATI STREAM AND .NET
                              hazeman

                               

                              Originally posted by: godsic Ok, I'm with you about OpenCL. I think that OpenCL is just a bottleneck between the GPU power and programmer.


                              You have misunderstood my post. I have only problem with bad quality OpenCL compiler made by ATI. OpenCL is like C for CPU. Saying that C is bottleneck is quite dumb.

                               

                              However, I strongly disagree with you about JIT and GPU.  Definitely, SPMD GPU approach is a bit different from the CPU, however GPU itself is a linear device with quite predictible behaviour.


                              From this sentence it's obvious that you didn't write any program for GPU. And for sure you didn't optimize any kernel for 90% of peak GPU performance. Anyone who had to do this knows that GPU is quite unpredictible. There are so many quirks and limitations which are undocumented ( specially in ATI case ) that sometimes it's guessing what will work best.

                               

                               

                              Additionaly, SPMD paradigm are quite common for HPC (for example MPI) and from that point of view there is no any difference between the GPU and CPU.


                              Again you are plainly wrong. Try to modify any MPI based program for GPU and you will see how wrong.

                              I won't comment rest of the post because it's based on wrong assumptions.

                                • ATI STREAM AND .NET
                                  godsic

                                  You have misunderstood my post. I have only problem with bad quality OpenCL compiler made by ATI. OpenCL is like C for CPU. Saying that C is bottleneck is quite dumb.

                                  Hm, C is just an abstract layer between the programmer and GPU and likely does not cause any performance bottleneck if the compiler is fine. Another thing is the programmer, if he has been grown on the CPU HE WILL NEVER ADOPT FOR THE GPU ARCHITECURE (where in majority of cases brutal calculation are far better then some "beautiful" algorithms) AND WILL BLAME AMD IN THE POOR COMPILER and basically C layer cause such problems, since programmer have a lot of opportunities to write the "wrong" kernel. Therefore I think IL or for example OpenGL shader language was fines since there is no any unnecessary freedom.

                                  Additionaly, If you are writting some complex software OpenCL API overhead (some kind of ctx switching) is a big bottleneck, since GPU memory is limited for REAL HPC problems and one needs to transfer data from GPU to CPU or vice verse quite frequently and therefore to achive a reasonable performance one need to cartful plan the memory usage, while it is a routine work and can be done in FRAMEWORK (which is not about OpenCL). 

                                   

                                  From this sentence it's obvious that you didn't write any program for GPU. And for sure you didn't optimize any kernel for 90% of peak GPU performance. Anyone who had to do this knows that GPU is quite unpredictible. There are so many quirks and limitations which are undocumented ( specially in ATI case ) that sometimes it's guessing what will work best.

                                  Hm, misunderstanding (so much) once again. GPU IS SPMD DEVICE AND YOU COULD REFER TO THE ANY AMD GPU DOCUMENTATION TO FIND OUT THIS STATEMENT. Moreover OpenCL virtually MPMD paradigm API (since GPU STILL SPMD). And If you do not see any similarities between MPI/OpenMP ( which is SPMD by default) and OpenCL then I assume that you never deal with HPC at all. It is just a matter of terms and hardware.

                                  Another thing is GPU IS A LINEAR DEVICE and at least for AMD it should be predictable, since it was designed and I'm sure that every AMD GPU architecture has been implemented as a virtual machine prior to the metal realization.

                                   

                                  Again you are plainly wrong. Try to modify any MPI based program for GPU and you will see how wrong.

                                  I won't comment rest of the post because it's based on wrong assumptions

                                  AND... MISUNDERSTANDING again, OpenCL has been proposed by APPLE as a universal API for any computing device! However, on practice one need to write either for CPU (which is ridiculous) or for GPU, while balance can not be achived for both. Therefore I assume that OpenCL do not hide any hardware issues (difference in architecture) for GPU and CPU for developer, so I assume that OpenCL is wrong!

                                  P.S. AS FOR CPU AND GPU DIFFERENCE, I'M SURE THAT IF THE X86 PATENT WAS HELDED BY AMD THEN X86 CAPABLE GPU WILL BE JUST A MATTER OF TIME AND MONEY.



                                   

                                    • ATI STREAM AND .NET
                                      hazeman

                                      First of all. Did you ever write any kernel for GPU ? And optimized it for high performance ? If the answer is "no" there is simply no point in futher discussion.

                                       

                                      AS FOR CPU AND GPU DIFFERENCE, I'M SURE THAT IF THE X86 PATENT WAS HELDED BY AMD THEN X86 CAPABLE GPU WILL BE JUST A MATTER OF TIME AND MONEY.


                                      Simply LOL. Did you ever hear about Larrabee ?? Anyway whether GPU uses x86 instructions or not doesn't change that it has different architecture.

                                       

                                      GPU IS SPMD DEVICE AND YOU COULD REFER TO THE ANY AMD GPU DOCUMENTATION TO FIND OUT THIS STATEMENT


                                      Yep and gpu can run thousands of threads. I've read those too. You should know there is difference between PR talk & reality.

                                      IMHO you have read too many GPGPU promotional slides and you don't know how real programming on gpu looks like.

                                       

                                        • ATI STREAM AND .NET
                                          godsic

                                           First of all. Did you ever write any kernel for GPU ? And optimized it for high performance ? If the answer is "no" there is simply no point in futher discussion

                                          I've start to use GPU for computation 3 years ago, from accelerating raytracing using GLSL (my config was based on HD2600Pro). After a weeks of optimization I send an email to ATI with some HARDWARE wish list which includes more branchinng units, more rational memory layout, bypassing samplers (please note that ATI claim that R6xx architecture was scalar rather then vector) and more rational UNIT/SPECIAL UNIT ration for example 2:1 rather than 5:1. However, ATI reject my email.

                                          Since that time I apply for a PhD in a different area. And GPU programming died for me until the AMD/ATI post the ATI Stream with Brook and CAL. After that I've practice in IL kernel writing for some image processing on my horrible HD3450 card.

                                          I'm familiar with OpenGL and GLSL and precisely explore the R6xx architecure and all that ATI optimization slides for GLSL at the time of the program developing. Principally, architecture of the AMD GPUs does not changed a lot from that time, only some usefull feautures have been introduced which are only improve the performance, however the workgroup structure, FPU unit design and memory layout (which is the main bottleneck) are not changed a lot. Therefore for a good game developers (which is quite familiar with GLSL or HLSL) it is not a porblem to write an efficient kernel since for example for R5xx architecture the general rule was a coherence (workgroup) and vectorization and high ALU:FETCH ratio.

                                          However, from my undarstading such standards as OpenCL or CUDA or CAL are inefficient, since they are not so open as one expect from them. Therefore it will be good to make a GPU enabled C, C++ compiler without ANY API CALLS. Only in that case compiler can investigate program layout, data interference and compile some parts in a proper kernels with a proper memory usage strategy. And as I said before some obvious features can be realized "in-metal" .

                                          As for MPI it is quite close to the OpenCL since you need to manage all this memory transfers by yourself and keep in mind all this MPI topologies (grids geometries). It is obvious similarity between PCIex bottleneck in GPGPU systems and network connection (InfiBand etc) in supercomputers. However, such supercomputers have special hardware features which assist the MPI, while GPGPU systems are not.

                                           

                                            • ATI STREAM AND .NET
                                              hazeman

                                               

                                              Originally posted by: godsic


                                               

                                              I've start to use GPU for computation 3 years ago, from accelerating raytracing using GLSL (my config was based on HD2600Pro). After a weeks of optimization I send an email to ATI with some HARDWARE wish list which includes more branchinng units, more rational memory layout, bypassing samplers (please note that ATI claim that R6xx architecture was scalar rather then vector) and more rational UNIT/SPECIAL UNIT ration for example 2:1 rather than 5:1. However, ATI reject my email.


                                              Read about Larrabee. Some of the features you requested where removed even there. But read whole Larrabee history and you will understand why ATI rejected you propositions.

                                               

                                              However, from my undarstading such standards as OpenCL or CUDA or CAL are inefficient, since they are not so open as one expect from them. Therefore it will be good to make a GPU enabled C, C++ compiler without ANY API CALLS. Only in that case compiler can investigate program layout, data interference and compile some parts in a proper kernels with a proper memory usage strategy.


                                              I hope that you understand that API CALLS will have to made by compiler anyway.

                                              Ok so you want compiler which will hide all the dirty work from coder and yet will produce higly efficient code ?

                                              I think the idea is great . I really would like such a compiler. But reality is harsh mistress.

                                              Few years ago Intel had idea of Itanium. In short they wanted to move some work from CPU to compiler ( instruction scheduling, etc ). And what they learned is that creating compiler for Itanium was really huge problem.

                                              And yet you want much more advanced compiler.

                                              I think that you making the same mistake as with your request letter to ATI. Your ideas in theory are great but you don't understand limitations you are facing. And those limitations are making your ideas bad.

                                                • ATI STREAM AND .NET
                                                  ryta1203

                                                  Yes, a "perfect" compiler would also read your mind and write the code, so I could chill on a beach with a nice drink, lol, sorry, had to.

                                                  Sadly, Itanium (and It. II) never really caught on.

                                                  Overall though, I tend to agree with Hazeman.

                                                  The main problem, IMO, with the current ATI architecture is that it is VERY difficult for a compiler to automatically vectorize code and get correct results.

                                                  Also, the compiler does a ***** poor job of eliminating control flow to allow for better packing. This I don't understand as there are VERY easy ways to auto generate this.

                                                    • ATI STREAM AND .NET
                                                      godsic

                                                      Sorry guys, but once again I am strongly disagree with you.

                                                      It is hard for compiler to optimize kernel simply because it does not see the whole picture.

                                                      Moreover, OpenCL or CAL compiler knows nothing about your memory layout or other kernels.

                                                      Therefore, it can only perform some general assumptions and make some general optimization.

                                                      However, if the whole picture will be available for compiler it can perform more aggressive optimization ( inter-procedural optimization if you want).

                                                      Therefore it will be nice to have a JIT for a full OpenCL program (including API calls, kernels etc) rather than standalone compiler for kernels only. In this case compiler can make some memory usage analysis, data flow, data interference and data structure. Therefore much aggressive optimizations can be done. And last, but not least vendors can make optimization based on the host GPU architecture rather then some specific code.

                                                       

                                                       

                                                        • ATI STREAM AND .NET
                                                          hazeman

                                                           

                                                          Originally posted by: godsic Sorry guys, but once again I am strongly disagree with you.

                                                           

                                                          It is hard for compiler to optimize kernel simply because it does not see the whole picture.



                                                          This is exactly what Itanium project was about. As compiler has  more information available, it should be able to make much more efficient code. And yet after spending a lot of $$$ on it, Itanium failed.

                                                          They say that smart person learns from the mistakes of others. So maybe you should start too.

                                                            • ATI STREAM AND .NET
                                                              godsic

                                                              Once again I'm disagree

                                                              Itanium and x86 architectures are quite similar in principal, since x86 CPU are in fact RISC processors with some "metal" CISC to RISC translator/compiler. Basically, x86 pipeline expand x86 instructions into micro operations or they going throw DirectPath without any dispatching (simple assumtion).

                                                              Therefore, modern x86 have some primitive on-die compiler with limited functions like out-of-oreder execution, branching prediction, registers renaming etc. However all this features can be done on software level without any PROBLEMS. Another question is marketing! It is far simple to update pipeline and release/sale new processors, while one can simply download new version of compiler and make application faster.

                                                              From the other side Itanium is a brilliant RISC processor, however for server segment money are really matter and therefore when Itanium comes into x86 segment it was quite hard for this CPU to be competitive in terms of money, since all software segment should be redeveloped in order to meet Itanium feautures. Moreover, Itanium systems require full hardware update. Therefore one can easily calculate the costs of Itanium platform for the end-users.

                                                              As for AMD GPU architecture - it is brilliant even with VLIW conception. Moreover this conception worked fine for TransMeta (GG), for some Russian processor and HPC segment. For example in Russian processors compiler can assemble up-to 30 instructions in VLIW and they can be executed in one CPU cycle.

                                                              Therefore, for GPGPU segment there is no any clear computational standard (in contrast to x86 domination in CPU segment). Therefore each vendor can make things at their own, like NVIDIA doing it in CUDA.

                                                              Therefore, alternative ways are quite important for AMD in battle for $$$. Thus I propose two possible ways:

                                                              1. AMD will pay $$$ to developers for the results. Basically, one can make a research of AMD GPUs in computational segment and write a simple article about. After that AMD pays money and post this article on their website for public access. With this approach developers community can work efficiently, without any re-inventing overhead.

                                                              2. If AMD cares about their architecture, they can make some kind of JIT for CAL/OpenCL and put as much efforts as they can in optimizations.