15 Replies Latest reply on Oct 7, 2009 2:14 PM by godsic

    OpenMP and GPU

    godsic

      In the light of GF100, it will be quite difficult for AMD survive on the GPGPU market, because GF100 is more flexible in compare to RV8xx and offers some very important extra futures (ECC memory regime). Nevertheless, it is more important for developers to spend less time to port their apps to GPGPU platforms like OpenCL, ATI CAL, CUDA. Therefore it will be nice to have GPU enabled version of OpenMP!!!!!.

      In most cases OpenMP follows SPMD execution paradigm which is similar to GPU (stream operations= SPMD). Simple analysis of the software paralized with OpenMP will show that in 90% cases developers use

      #pragma omp parallel for ...

      clause which can be easy ported to any GPGPU platform on compiler level with all necessary  assumptions (such as GPU or CPU, how to manage memoty etc.). Moreover AMD can extend OpenMP standard to allow developers manualy adjust some GPU related counterparts.

      Therefore, developers can easy recompile existing OpenMP code and achieve more perf with GPU! And AMD can focus on optimizations without any developers impact, because in 100% performance of GPU application are strongly depend on the knowledges of developer in GPU architecture, which is not trivial.

      OpenCL is powerful standard, but it closer to classic GPU framework  OpenGL rather then standard co-processor paradigm.

        • OpenMP and GPU
          Illusio

          Wouldn't the natural approach be for whoever supports OpenMP today to add an OpenCL back-end to it?

           

            • OpenMP and GPU
              godsic

              May be inefficient. OpenMP with ATI CAL back-end is more luckily.

              The main advanteges is:

              1. Only AMD GPU will be suppoted.

              2. Reduction of unnecessary API layers.

                • OpenMP and GPU
                  cgorac

                  OpenMP is just one of many approaches for programming for shared memory paradigm, mildly successfull at best - lots of people preferred to simply go with POSIX threads before, and multitude of different solutions for this problem were/are used back then and now, with more and more awareness of the GPUs in the recent times, like Intel TBB, RapidMind, etc., and even more of alike are coming soon, like Intel Ct. So, it is far from that OpenMP is some kind of widely accepted API (like what is the case with MPI for message-passing programming), and as people seem to didn't liked it much in the past, I see no reason why one would try to push it on the GPU now. The current situation, with multitude of APIs for programming many-core machines (besides some of above mentioned, there exist CUDA, DirectCompute, etc.) is certainly far from beign satisfactory. But probably in couple years a de-facto standard will emerge; again, I think there are very small chances this one to be OpenMP.

                  Edit: on the other side, seems like these guys: http://www.caps-entreprise.com/hmpp.html are already onto something alike...

                    • OpenMP and GPU
                      godsic

                      I think there is some misunderstanding.

                      1. OpenMP is widely used in HPC for parallelize code within one node, while MPI used for inter node parallelization ( mostly communication) - so called "mixed mode".  No one use POSIX directly, in many cases OpenMP has POSIX back end. Moreover OpenMP is HPC standard, while POSIX isn't.

                      2. I am not propose replace ATI CAL with OpenMP, I'm just saying that it will be some goal for AMD to include ATI CAL back end to OpenMP framework for example in their Open64 compiler. I think it is quite easy!!!!!!!!!

                      3. GPU programing is never be perfect for typical developers, because paradigms is slightly differ from CPU and developers never realize this!!!!!  Only game developers can effectively ( :-) ) programming GPU now, because there is no difference between ATI CAL, OpenCL and OpenGL!!!!!!

                       

                        • OpenMP and GPU
                          cgorac

                          1. As for OpenMP/MPI relationship, I think I was clear enough in my message that they are intended for completely different types of parallel programming, and that I only mention MPI as an example of parallel programming approach that achieved domination in its own domain (message-passing programming, or if you want "inter node parallelization"), which is not the case with OpenMP in its domain (shared-memory programming).  Furthermore, maybe you're not aware of it, but people actually do use POSIX threads directly - I used it a lot in pre-OpenMP days (~10 years ago), and would use it any day now; as you said, OpenMP is usually based on POSIX threads on any POSIX platform (and in my part of HPC universe, I never encountered anyone thinking it would be worth considering non-POSIX platform for serious HPC work), so there is no performance advantage in using OpenMP, so the only advantage of it would be some moot points like OpenMP being "more readable", "easier to code", etc.

                          2. Maybe possible, but seems like many people (NVIDIA, Intel, some independent developers of wrapper-like libraries, and even ATI/AMD itself) already put lots of though in designing API for programming for GPUs (and, in the recent time, for programming for both GPUs and CPUs), and nobody thought OpenMP is worth consideration. I'm pretty sure this is because they found OpenMP is not actually used that much, and also I guess it's also because they found it doesn't fit existing GPU programming paradigms well.  There may be a point in the future, as GPUs seem to slowly migrating towards ressembling more to multi-core CPU systems (see Larabee, and recent NVIDIA announcements about Fermi architecture), that they find OpenMP worth considering, but frankly I doubt so.

                          3. Wrong (that only game developers could program GPU) - I'm as far from being game developer as one could be, and am programming GPUs for more than 2 years without any problems.  This kind of statement may hold true before 2007., but then NVIDIA did really nice job with CUDA, it was indeed first and very nice API, from paralle programmer point of view, for GPU programming.  And your further statement that there is no difference between OpenGL and OpenCL is just plain silly - have you actually ever tried to write any code for GPU, or you are talking all of this just out of some kind of OpenMP devotion, and wishful thinking?

                            • OpenMP and GPU
                              godsic

                              2 year ago I wrote (just for fun) a OpenGL engine which implement shadow casting technique and the performance was at the level of ~2fps for ~1M polygons (texturing+bump was included). That time I use HD2600Pro and I spend near 2 weeks for optimization and profiling, till AMD GPU PerfStudio show me optimum resluts. After I send and email to AMD with some suggestions about GPU architecture, which can improve the GPGPU performance (branching, TMU, large 1D samplers ( see large page adresing), posablilty to fetch from the memory without samplers! etc). From that time I move to HPC and forgot about my engine. Once I saw specification of OpenCL , and I notice that there is  A LOT of common with OpenGL.

                              Maybe you can programming GPU, but what about performance of your application. I guess they far from theoretical limit. It is not NVIDIA or CUDA fault!

                                • OpenMP and GPU
                                  cgorac

                                  Admittedly, the set of applications that maps well to GPU (in the sense that it is possible to employ parallel processing units of the GPU to its peak performance) limited at the moment, but still there exist many important HPC applications that already maps well to the GPU - as for myself, I was working on dense matrix algebra, FDM solvers, and signal/image processing, and there exist many more domains that people are trying to port their applications to GPU.  You're right that GPU were lacking in some aspects, when compared to the CPU, but I think you should really research a bit about recent/coming developments, like Larabee or NVIDIA Fermi - there is lots of exciting stuff (like some of things you mentioned above) coming: ECC memory, real L1/L2 caches, full performance double precision support, etc.  Some other things you mentioned (like branching, to some extent, and possibility to fetch from GPU memory without samplers) are actually (if I understood properly what you think on here) already available in CUDA/OpenCL.  It just seems to me that you've been working with AMD/ATI hardware/tools only; I'd say AMD is way behind when general-purpose GPU computation is about (with recent OpenCL developments, things are improving quickly, and this is why I started to track it with great interest; albeit, we're yet to see how exactly AMD proceed with all this, for example, I read that size-4 vectors will still have to be used to get peak performance from AMD implementation of OpenCL for GPU, which is very disappointing, as scalar programming model, that NVIDIA is using, is way easier to deal with), so I'd really suggest you to research around about other developments - there are many exciting things going on in the field...

                                    • OpenMP and GPU
                                      godsic

                                      Since R6xx, AMD GPU are scalar (not vector). Therefore on ALU level there is no difference in performance of float and vec4. From the other side AMD GPU (as well as NVIDIA) fetch 128 bit from the memory at the same time, so thats the main point for optimization.

                                        • OpenMP and GPU
                                          cgorac

                                          Thanks for the clarification regarding scalar/vector operation; any pointer to some docs where this is discussed in further detail?

                                          Also: do you mean 128 bits, or bytes, above?  As NVIDIA hardware is able to fetch 128 bytes in single memory transaction...

                                            • OpenMP and GPU
                                              godsic

                                              Download ATI Stream v1.4 Beta from AMD developer site and you will find all information in ~/doc/

                                              As for fetching I mean 128 bit (16 byte) for 1 clock, NVIDIA probably have 128 byte cache line ? It is differ.

                                               

                                               

                                                • OpenMP and GPU
                                                  cgorac

                                                  Hmm, I'll have to check 1.4 SDK docs then; I started working with SDK 2.0-beta (because of OpenCL), and most docs there actually contains "to be written later" notice...  In any case, I'm pretty sure I noticed strong suggestions to base the code around float4 datatype in at least several places; see for example slide #15 of Mike Houston' presentation (http://www.khronos.org/developers/library/2009_siggraph_bof_opencl/OpenC_BOF_-_AMD-Siggraph_Aug09.pdf) at this year SIGGRAPH.

                                                  As for memory transfer width, I think we're probably talking about different things: NVIDIA hardware is indeed able to fetch 128 (successive, aligned) bytes from device memory into GPU registers in single memory transaction, but this operation has very high latency (400-600 cycles); arranging memory access patterns to enable for realizing this kind of memory access, as well as providing enough computations in the kernel to hide this latency, are main parts of the strategy of efficiently utilizing NVIDIA hardware.

                                                    • OpenMP and GPU
                                                      godsic

                                                      vec4 ( see vectorization) is a good practice for achieving maximum performance even in scalar architecture, because it is easy for compiler to load-balance such operations. In some scalar case it can be difficult. Try to compile some kernel(or shader) in StreamKernelAnalyzer and have a look on disassemble (not IL, real!). You will see that compiler try to rearrange code to load on all 4 scalar units + 1 extra unit.

                                                        • OpenMP and GPU
                                                          PGK

                                                          OpenMP v3.0 is the standard for (parallel) HPC on Shared Memory systems. I also believe a (post-v3.0) working group has now been assembled to investigate support for accelerator cards. See http://www.openmp.org for further details.

                                                          Or try this, from 2007: http://www2.epcc.ed.ac.uk/msc/dissertations/dissertations-0607/1329987-27i-d07rep1.1.pdf

                                                            • OpenMP and GPU
                                                              cgorac

                                                              This thesis is nice, but - where is the code then, to take OpenMP programs, and just rebuild them for GPU?

                                                              Something being proclaimed standard by some group doesn't mean much in itself, for real work de-facto standards are sometimes much more important, and at the moment this is exactly the case for GPGPU programming.  For example, CUDA is not standard, and it probably will never be, but: CUDA is multi-platform, supporting C and Fortran already, and C++ support coming, with more and more debugging/profiling tools coming, etc.  Thus, CUDA is what people are using today for GPGPU programming.  When OpenMP working group finalize its investigation, and when someone provide CUDA/OpenCL/... backend for OpenMP that will make it possible to compile OpenMP codes for GPGPU (and in such a way that resulting executables are fast), then people will probably look at it; but at the moment, it's just wishfull thinking.

                                                              • OpenMP and GPU
                                                                godsic

                                                                to PGK:

                                                                These thesis is just the theory.

                                                                 TO AMD:

                                                                  I believe that AMD, as a CPU and GPU vendor, MUST (not should) develop their own compiler (Win and UNIX)  with all described futures. GPU is just one side of the problem, compiler also must done well with CPU. It must decide in which cases use GPU or CPU.

                                                                Without software support and own innovative technologies (Not Intel copying) AMD will be destroyed by Intel and NVIDIA soon. Who wants high theoretical performance with hard way of achieving it. And that way can be simplified if AMD will release its own compiler for Windows and UNIX which will use both GPU and CPU power.