3 Replies Latest reply on Nov 30, 2011 10:40 AM by eugenek

    Native sine/cosine

    eugenek

      I want to squeeze some performance out of code that does many sines/cosines.

      Consider attached code.

      The compiler from APP SDK 2.5 generates the following ISA code for 6970, with and without -cl-fast-relaxed-math:

            2  z: MUL_e       R0.z,  R0.x,  (0x3DFCB924, 0.1234000027f).x     
            3  y: MUL         R0.y,  PV2.z,  (0x3E22F983, 0.1591549367f).x     
            4  x: SETGT       ____,  |PV3.y|,  (0x42480000, 50.0f).x     
               w: FRACT       ____,  PV3.y     
            5  z: CNDE_INT    ____,  PV4.x,  R0.y,  PV4.w     
            6  y: CNDE        ____,  R0.z,  R0.z,  PV5.z     
            7  x: SIN         R0.x,  PV6.y     
               y: SIN         ____,  PV6.y     
               z: SIN         ____,  PV6.y   

      Evidently, native SIN instruction on the 6970 really calculates sin(2*pi*x) and its argument must lie within the range of -50 to 50. The instruction set architecture document confirms this but says that the valid range is -256 to 256. No matter.

      Here's what's going on here:

      Line 2, I multiply the argument b 0.1234

      Line 3, compiler divides it by 2 pi

      Lines 4-6, compiler truncates the argument into valid range

      Line 7, we  finally get to compile the sine.

      What I want is to write code that will generate just three instructions, "MUL, "FRACT" and "SIN". Constants could be divided by 2 pi at compile time (in fact, I'm surprised that the compiler does not merge lines 2 and 3), and I'd always execute FRACT because... well, it's always executed as is, so what's the point having a conditional on top of that?

      Unfortunately, there's no native_sinpi or native_fract in the OpenCL standard. modf(float) and fract(float) generate bulky code.


      Is there any way to do this without rewriting everything in ISA?

      __kernel void test(__global float* p) { const float a = 0.1234f; p[0] = native_sin(a*p[0]); }

        • Native sine/cosine
          eugenek

          I'd settle for a way to do this from IL! Trying to do that now, I've managed to get rid of conditionals and to insert FRACT manually. But I the second MUL is still there and I can't find a way to get rid of it. (And, without that, being able to insert FRACT is useless, because FRACT must go after MUL.)

          IL:

              mul r254.x___, r254.xxxx, l9.xxxx
              frc r254.x___, r254.xxxx
              sin r254.x___, r254.xxxx

          ISA:

               2  w: MUL         ____,  R0.x,  (0x3DFCB924, 0.1234000027f).x     
                3  z: FRACT       ____,  PV2.w     
                4  y: MUL         ____,  PV3.z,  (0x3E22F983, 0.1591549367f).x     
                5  x: SIN         R0.x,  PV4.y     
                   y: SIN         ____,  PV4.y     
                   z: SIN         ____,  PV4.y

           

           

            • Native sine/cosine
              himanshu.gautam

              Hi eugenek,

              I have asked sbout the sine functions range ambiguity to concerned people.

              Thanks for reporting that.

                • Native sine/cosine
                  eugenek

                  himanshu.gautam,

                  Range ambiguity is the least of my problems! I want to be able to generate an instruction SIN (or COS) without also generating a multiplication by 0.1591549367f, either from CL or from IL. The fundamental problem seems to be that the native instruction SIN in the 6970 instruction set does not agree with the IL instruction SIN or in the IL specification, and the IL compiler is not smart enough. (And it's not even a new problem, because I've checked the Evergreen/5xxx instruction set, and things worked exactly the same way there, too.)

                  AMD has a significant edge over NVIDIA in performance of its single-precision floating-point sine and cosine. The 6970 should be able to peak at 338G sines/second. NVIDIA's 580 can only do 99G. But this compiler silliness goes some way to reverse the advantage.

                  And I also want to be able to generate an instruction FRACT from CL. But that is lower priority.


                  Overall, my sentiments are well expressed by this thread:

                  http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149806&highlight_key=y&keyword1=direct

                  Ideally, I want to be able to write "a=b*c; d=native_amd_fract(a); e=native_amd_sin(d); " and expect that my code will be compiled into three native instructions without any silly multiplications, conditionals, or other code that the compiler might consider necessary to inject into my bottleneck.