13 Replies Latest reply on May 30, 2009 5:09 PM by anushkagamage

    Problem with Pi calculation

    johnnyb

      Hi,

      I'd like to calculate Pi with Gregory-Leibniz series with this kernel:

      kernel void pi_GregoryLeibniz(float size, out float output<>
      {
          float2 pos = indexof(output).xy;
          float i = size * pos.x + pos.y;
          output = pow(-1.0f, i) * (4.0f / (1.0f + 2.0f * i));
      }

      then sum it with reduce kernel

      kernel void pi_sum(float input<>, reduce float output)
      {
          output += input;
      }

      I get just a messy output: -1.#J  -1.#J  -1.#J  -1.#J  -1.#J  -1.#J  -1.#J  -1.#J  -1.#J  -1.#J when I'm printing the output on the screen ... The result of the reduce kernel's always 0.0 regardless to the content of the input stream.

      What's the matter with my code?

      My second question is how can i use multiple kernels in my program? I've tried to use output of the first kernel as input for second like this:

      pi_GregoryLeibniz(size, outputStream);

      pi_sum(outputStream);

       

      Thanks for advices.

        • Problem with Pi calculation
          ryta1203

          1) Your "reduce" kernel needs to be called such, right now it's just a regular kernel with a reduce function as an output. The function (kernel) itself has to be declared reduce. You can look at the reduce example in the SDK.

          2) Your second thing shouldn't be a problem. Fix the first thing first and see if that is still a problem.

            • Problem with Pi calculation
              johnnyb

              Hi ryta,

              thanks for the answer. You're right, replacing 'kernel' with 'reduce' made the trick but the output is the same mess. Maybe something's wrong with the output = ... line. If I change the right side just for 'i' (output = i;) everything works fine. Is it bug in brcc or I'm missing something?

                • Problem with Pi calculation
                  ryta1203

                  1) When you call pi_sum(outputStream) where is the output variable?

                  2) There is something wrong with the "pow()" function call. If you comment that out and look at the outputStream before you call pi_sum, you get valid output. If you include the "pow()" call, you don't. 

                  3) -1^n = -1. If you look at the docs, pow(x,y) = x^y, so you are taking -1 to the i. Is that what you intended? I'm unfamiliar with the algorithm, but if that is what you want and -1^i = -1, just don't do this step.

                   

                   

                    • Problem with Pi calculation
                      johnnyb

                      Hi,

                      1) outputStream is in 'gpu memory', I don't write it out, just simply use the output of 1st kernel as input for the 2nd one.

                      2) Exactly. Something's wrong with pow() function ...

                      3) -1^n = -1, if n = 2k+1(odd) and -1^n = 1, if n = 2k (even). I've looked into docs and pow()'s what I want. Unfortunately it doesn't work for me.

                        • Problem with Pi calculation
                          gaurav.garg

                          Does CPU backend produce correct output?

                              • Problem with Pi calculation
                                ryta1203

                                johnnyb,

                                   Just do the "pow()" yourself.

                                 

                                kernel void pi_GregoryLeibniz(float size, out float output<>

                                {

                                float2 pos = indexof(output).xy;

                                float i = size * pos.x + pos.y;

                                int x;

                                float temp=1.0f;

                                //output = (pow(-1.0f, i)) * (4.0f / (1.0f + 2.0f * i));

                                for(x=0;x<(int)i;x++)

                                {

                                temp *=-1.0f;

                                }

                                output = temp*(4.0f / (1.0f + 2.0f * i));

                                }

                                 

                                With a size of 1.0f and an outputStream size of 1000 (1000 iterations according to the algorithm), I get 3.14..... (so that's pretty close) as the result using this kernel.

                                Each thread will take the same path, so this shouldn't (this is just an educated guess, someone from AMD could be more precise) effect performance much, if at all.

                                This DOES effect your GPR usage, ALU:Fetch ratio, Throughput, etc...

                                Your original kernel appears to be much better according to the KSA, but this might do the trick until you figure out what is wrong.



                                  • Problem with Pi calculation
                                    ryta1203

                                    This brings up a question:

                                    1) How does the "pow()" function work?

                                    2) Is it possible to use a reduce function as a solution here, since that is essentially what a pow() is? I can't think of a way right off the top of my head.

                                    • Problem with Pi calculation
                                      johnnyb

                                      Hi ryta,

                                      thanks for the tip. Unfortunately the workaround is much much slower than the pow(). It's interesting that the code works perfectly on CPU runtime and crashes and burns on CAL. I hope that'll get a _real_ solution for this problem.

                                      BTW thanks for your time :)

                                        • Problem with Pi calculation
                                          ryta1203

                                          Yep, looks like "pow()" is BROKEN. I guess they didn't test all possible cases when rolling this thing out. This needs to be reported as a bug. There is no error with the resulting stream either.

                                          It has a problem when getting a negative number for "x". It doesn't seem to like that much at all.

                                          Out of curiousity, how much slowdown did you see on the GPU side by using the for loop johnny?

                                            • Problem with Pi calculation
                                              johnnyb

                                              Here are my results:

                                              GFX: Radeon 4870, SKA 1.1.77

                                              Your code:

                                              GPR: 5, Min: 1.20, Max: 77.40, Avg: 17.59, Est. Cycles: 17.59, ALU:Fetch: 17.59, Thread/Clock: 0.91, Throughput: 682 M Threads/Sec

                                              Your code (optimized by me - replacing int for float)

                                              GPR: 5, Min: 1.10, Max: 51.90, Avg: 12.03, Est. Cycles: 12.03, ALU:Fetch: 12.03, Thread/Clock: 1.33, Throughput: 998 M Threads/Sec

                                              Original (my) code:

                                              GPR: 2, Min: 1.00, Max: 1.00, Avg: 1.00, Est. Cycles: 1.00, ALU:Fetch: 1.00, Thread/Clock: 16.00, Throughput: 12000 M Threads/Sec

                                              I can't measure for you due to pow() bug but the assembly code looks more complicated than with pow().