5 Replies Latest reply on Jun 9, 2010 2:24 PM by koarl0815

    ViennaCL: Linear Algebra on GPUs using OpenCL

    koarl0815
      An open source library in C++

      Dear Stream SDK users,

      I am proud to announce the release of the Vienna Computing Library (ViennaCL), an open source (MIT license) scientific computing library written in C++ based on OpenCL. It allows simple, high-level access to the vast computing resources available on parallel architectures such as GPUs and is primarily focused on common linear algebra operations (BLAS level 1 and 2) and the solution of large systems of equations by means of iterative methods. At present, the following iterative solvers are implemented:
        * Conjugate Gradient (CG)
        * Stabilized BiConjugate Gradient (BiCGStab)
        * Generalized Minimum Residual (GMRES)
      An optional ILU preconditioner can be used, which is so far precomputed on the CPU and may thus not lead to overall performance gains.

      The library interface is similar to that of the ublas library, which is shipped with Boost. The iterative solvers can be used either on the CPU with ublas types or on the GPU using ViennaCL types. Consequently, there are only a few code changes in existing simulators necessary to get the iterative solvers running on the GPU.

      At present, ViennaCL does not provide double precision using Stream SDK 2.1, because not all functionality as defined in the double precision extension in the OpenCL standard is implemented. Moreover, the current version of ViennaCL only uses GPUs via OpenCL, but not CPUs. This will most likely change in the next revision.

      More information can be found on the project homepage located at http://viennacl.sourceforge.net/ (remark for the forum rules: we don't earn anything if you click on that link)

      If you have any questions, feel free to ask them here :-)

      Best regards,
      Karli

        • ViennaCL: Linear Algebra on GPUs using OpenCL
          cjang

          Hi, I have some questions.

          Are the CPU implementations optimized? Specifically, are the calculations blocked to fit in cache? An example of this kind of tuning is the ATLAS BLAS library. And do the CPU implementations use all of the cores? I ask as the usual wisdom is that CPUs have the advantage when arithmetic intensity is low due to very high effective memory bandwidth.

          The other thing I wonder is about what performance is like with dense matrices, specifically something like a covariance matrix (so is symmetric positive definite and dense).

          • ViennaCL: Linear Algebra on GPUs using OpenCL
            dominik_g

            Hi Karl,

            this looks like an interesting project. I have a few questions:

            Is there a document about the internals of ViennaCL? For example, are you trying to merge several operations into one kernel to reduce overheads or is the OpenCL code static and you have a kernel call for each operation? (I only scanned the manual, maybe this is stated in there...)

            You said you're planning to support CPUs in the next release. What exactly is that gonna look like? Will you have different OpenCL kernel versions for CPUs and GPUs and then let the user decide which device to use? Or are you trying to automatically distribute the work across multiple devices?

            Cheers
            Dominik

              • ViennaCL: Linear Algebra on GPUs using OpenCL
                koarl0815

                Hi,

                thanks for your questions and interest :-)

                @cjang:
                The implementations of the solvers are generic, so the same code is used for CPUs (via e.g. ublas) and GPUs (via the built in types of ViennaCL). Other linear algebra types can also be used after registering them at the wrapper functions (norm_2, inner_prod and the like). Hence, the level of optimization of the CPU implementations depends e.g. on ublas.

                As for all calculations, performance for dense matrices depends on the type of operations and the involved sizes (i.e. amortization of the OpenCL overhead). I remember that I got good performance even for partly serial operations such as LU-factorization, but I can't give you exact numbers right now. Benchmark results are soon to follow.

                @ dominik_g:

                At present, we don't have a document about the internals, apart from the standard doxygen output, which I admit is not very satisfactory. We try to collect as many operations as possible into a single kernel by using expression templates. For example, the operation
                              vec1 += alpha * vec2
                is fused into a single OpenCL kernel using expression templates. This is fine in many cases. The more complicated operation
                            vec 1 = alpha * vec2 - beta * vec3 + gamma * vec4
                creates a few temporaries, but less than with "naive C++". We already thought about creating compute kernels on the fly from the expression templates, but the effort for compiling a custom OpenCL kernel is often too high in order to be useful.

                As for the multi-core CPU support in ViennaCL, we aim at providing a fall-back solution (if no GPU is present) in version 1.0.3 (scheduled for no later than next week). In later versions, we plan to provide support for multiple devices. This is, however, a really tricky thing, because operands may reside on different devices.

                Cheers,
                Karli