8 Replies Latest reply on Apr 11, 2011 8:06 PM by yngvesl

    Vectorization of exp and pow for CPUs (App SDK)

    yngvesl

      Hi.

      I am comparing App SDK/OpenCL and OpenMP at the time being. The app I'm studying spends most of its time doing exps and pows. With icpc (Intel C++ compiler) and OpenMP, all of these functions are translated into their vectorized equivalents (svml). But by the benchmark results (OpenCL version takes roughly 2x the time), these functions aren't translated to vectorized editions by the OpenCL compiler. Is there any way to achieve this? My experiments have been on Intel processors for now, would it make a difference if I tried on AMD  CPUs?

      Note that I am using the vector features in App SDK for the code in general, and I've seen that e.g. addition examples achieves a 2x speedup. However, this does not help much when maybe 40-50% of the time is spent doing exps and pows.

      If this is not possible, I think it is a huge limitation of the App SDK. One could argue with the portability feature, but you often need to write different kernels for different devices anyways, and then I could just cook up some OpenMP code that does the same thing under some abstraction.

      Best regards,

      Yngve Sneen Lindal

       

       

        • Vectorization of exp and pow for CPUs (App SDK)
          himanshu.gautam

          yngyse,

          AMD APP SDK is capable of translating instructions in vectorized form. In  some cases when it is hard to determine vectorizeable parts Loop unrolling can be used to make code vectorizeable.

           

            • Vectorization of exp and pow for CPUs (App SDK)
              yngvesl

              Yes, I know that it's capable of that. And yes, unrolling is possible. I don't have a loop in my kernel (and no branches) but I would assume that using the double2 data type with an exp function should make use of a vectorized edition of the exp function? See the attached code. As for now, it does not.

               

               

              double2 x = vload2(offset, data); double2 result = exp(-0.5*pow((x-mu)/sigma,2.0)); vstore2(result, i, results);

                • Vectorization of exp and pow for CPUs (App SDK)
                  yngvesl

                  By replacing the pow you see in the code above with a multiplication (x*x), I achieved large speedups. I'm in the process of optimizing, and will update this post as soon as I know more.

                    • Vectorization of exp and pow for CPUs (App SDK)
                      eugenek

                      Out of curiosity, what is the assembly code generated with Intel compiler for exp(double2)? CPUs don't have any native instructions to do more than one exp() at a time. Vectorization should not help here.

                        • Vectorization of exp and pow for CPUs (App SDK)
                          yngvesl

                          No, maybe not native instructions, but we always have vector units that can be utilized for approximated functions.

                          SVML has a range of different functions utilizing SSE. Take a look here for example: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/index.htm#intref_cls/common/intref_svml_overview.htm

                          There you have e.g. 

                           

                          extern __m128d _mm_exp_pd(__m128d v1);

                          which is the one I'm talking about.

                           

                           

                            • Vectorization of exp and pow for CPUs (App SDK)
                              Jawed

                              You might like to try Intel's OpenCL implementation.

                              I couldn't find statements of the accuracy of the SVML functions. AMD's implementation of double has no statements of accuracy, either (I can't find them, anyway). That's because AMD is currently offering a vendor-specific extension to implement double, rather than cl_khr_fp64, which is the official extension.

                              So you'll have to deal with varying questions of accuracy.

                              AMD's double support is still evolving. Presumably at some point there will be full support for cl_khr_fp64. Which would achieve the accuracies stated in section 9.3.9.

                              Instead of pow() you might prefer to use log/exp. See attached examples. There's obviously a question of accuracy with log/exp. cl_khr_fp64 states <= 3 ULP for each of log and exp, but <=16 for pow. So you might get better accuracy.

                              The log/exp combination is also faster on AMD GPUs. No idea what happens on CPU though.

                              #pragma OPENCL EXTENSION cl_amd_fp64 : enable kernel void test(global double *A, global double *B) { int pos = get_global_id(0); B[pos] = pow(A[pos], A[pos + 1]); } kernel void test_ln_exp(global double *A, global double *B) { int pos = get_global_id(0); double C = log(A[pos]); B[pos] = exp(C * A[pos + 1]); } kernel void test2(global double2 *A, global double2 *B) { int pos = get_global_id(0); B[pos] = pow(A[pos], A[pos + 1]); } kernel void test_ln_exp2(global double2 *A, global double2 *B) { int pos = get_global_id(0); double2 C = log(A[pos]); B[pos] = exp(C * A[pos + 1]); }

                                • Vectorization of exp and pow for CPUs (App SDK)
                                  golgo_13

                                  Many of the double math built in functions for the CPU device don't yet take full advantage of the "PD" SSE2 instructions.  I expect this to change over the next few releases.  I also expect to to see advantage taken of AVX+FMA{3,4} on processors that offer these features at some point as well.

                                  By the way, I think yngvesl is taking a huge performance risk betting that the compiler will transform pow(x, 2.0) or powr(x, 2.0), or even pown(x, 2) into x*x.   Which compilers currently offer this transformation?  Why not just write x*x in the first place and avoid the risk?

                                  I'm curious.  How many also expect pow(x, 0.5) or powr(x, 0.5) or rootn(x, 2) to be transformed to sqrt(x), or rootn(x, 3) to be transformed into cbrt(x)?

                                    • Vectorization of exp and pow for CPUs (App SDK)
                                      yngvesl

                                      Hi golgo_13.

                                      You're absolutely right, the code is stupid. But I've already pointed that out. If you look at my 3rd post you see that I replaced this obvious performance killer.

                                      Actually, Intel Compiler v. 12 (and maybe lower) does this optimization. I'm not defending the way to do it, since one obviously should use multiplication instead, but just to comment on your question.