Archives Discussions

johnnyb · ‎02-21-2009

Hi,

I'd like to calculate Pi with Gregory-Leibniz series with this kernel:

kernel void pi_GregoryLeibniz(float size, out float output<>
{
    float2 pos = indexof(output).xy;
    float i = size * pos.x + pos.y;
    output = pow(-1.0f, i) * (4.0f / (1.0f + 2.0f * i));
}

then sum it with reduce kernel

kernel void pi_sum(float input<>, reduce float output)
{
output += input;
}

I get just a messy output: -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J when I'm printing the output on the screen ... The result of the reduce kernel's always 0.0 regardless to the content of the input stream.

What's the matter with my code?

My second question is how can i use multiple kernels in my program? I've tried to use output of the first kernel as input for second like this:

pi_GregoryLeibniz(size, outputStream);

pi_sum(outputStream);

Thanks for advices.

ryta1203 · ‎02-21-2009

1) Your "reduce" kernel needs to be called such, right now it's just a regular kernel with a reduce function as an output. The function (kernel) itself has to be declared reduce. You can look at the reduce example in the SDK.

2) Your second thing shouldn't be a problem. Fix the first thing first and see if that is still a problem.

johnnyb · ‎02-22-2009

Hi ryta,

thanks for the answer. You're right, replacing 'kernel' with 'reduce' made the trick but the output is the same mess. Maybe something's wrong with the output = ... line. If I change the right side just for 'i' (output = i;) everything works fine. Is it bug in brcc or I'm missing something?

ryta1203 · ‎02-22-2009

1) When you call pi_sum(outputStream) where is the output variable?

2) There is something wrong with the "pow()" function call. If you comment that out and look at the outputStream before you call pi_sum, you get valid output. If you include the "pow()" call, you don't.

3) -1^n = -1. If you look at the docs, pow(x,y) = x^y, so you are taking -1 to the i. Is that what you intended? I'm unfamiliar with the algorithm, but if that is what you want and -1^i = -1, just don't do this step.

johnnyb · ‎02-22-2009

Hi,

1) outputStream is in 'gpu memory', I don't write it out, just simply use the output of 1st kernel as input for the 2nd one.

2) Exactly. Something's wrong with pow() function ...

3) -1^n = -1, if n = 2k+1(odd) and -1^n = 1, if n = 2k (even). I've looked into docs and pow()'s what I want. Unfortunately it doesn't work for me.

gaurav_garg · ‎02-22-2009

Does CPU backend produce correct output?

johnnyb · ‎02-22-2009

Yes.

ryta1203 · ‎02-22-2009

johnnyb,

Just do the "pow()" yourself.

kernel void pi_GregoryLeibniz(float size, out float output<>

{

float2 pos = indexof(output).xy;

float i = size * pos.x + pos.y;

int x;

float temp=1.0f;

//output = (pow(-1.0f, i)) * (4.0f / (1.0f + 2.0f * i));

for(x=0;x<(int)i;x++)

{

temp *=-1.0f;

}

output = temp*(4.0f / (1.0f + 2.0f * i));

}

With a size of 1.0f and an outputStream size of 1000 (1000 iterations according to the algorithm), I get 3.14..... (so that's pretty close) as the result using this kernel.

Each thread will take the same path, so this shouldn't (this is just an educated guess, someone from AMD could be more precise) effect performance much, if at all.

This DOES effect your GPR usage, ALU:Fetch ratio, Throughput, etc...

Your original kernel appears to be much better according to the KSA, but this might do the trick until you figure out what is wrong.

ryta1203 · ‎02-22-2009

This brings up a question:

1) How does the "pow()" function work?

2) Is it possible to use a reduce function as a solution here, since that is essentially what a pow() is? I can't think of a way right off the top of my head.

johnnyb · ‎02-22-2009

Hi ryta,

thanks for the tip. Unfortunately the workaround is much much slower than the pow(). It's interesting that the code works perfectly on CPU runtime and crashes and burns on CAL. I hope that'll get a _real_ solution for this problem.

BTW thanks for your time 🙂

ryta1203 · ‎02-22-2009

Yep, looks like "pow()" is BROKEN. I guess they didn't test all possible cases when rolling this thing out. This needs to be reported as a bug. There is no error with the resulting stream either.

It has a problem when getting a negative number for "x". It doesn't seem to like that much at all.

Out of curiousity, how much slowdown did you see on the GPU side by using the for loop johnny?

johnnyb · ‎02-22-2009

Here are my results:

GFX: Radeon 4870, SKA 1.1.77

Your code:

GPR: 5, Min: 1.20, Max: 77.40, Avg: 17.59, Est. Cycles: 17.59, ALU:Fetch: 17.59, Thread/Clock: 0.91, Throughput: 682 M Threads/Sec

Your code (optimized by me - replacing int for float)

GPR: 5, Min: 1.10, Max: 51.90, Avg: 12.03, Est. Cycles: 12.03, ALU:Fetch: 12.03, Thread/Clock: 1.33, Throughput: 998 M Threads/Sec

Original (my) code:

GPR: 2, Min: 1.00, Max: 1.00, Avg: 1.00, Est. Cycles: 1.00, ALU:Fetch: 1.00, Thread/Clock: 16.00, Throughput: 12000 M Threads/Sec

I can't measure for you due to pow() bug but the assembly code looks more complicated than with pow().

ryta1203 · ‎02-22-2009

Yes, I'm aware of the KSA stats, as I mentioned that earlier.

HOWEVER, does the KSA take into account the code for the actually "pow()" function?

anushkagamage · ‎05-30-2009

Hi johnnyb;

I am doing similar problem using a array,,Finally I got the output stream what i wanted,Then I got a problem when I tried to get the max,min avg values of that stream ,could u tell me how to get that,,Pls,,

Thank you,

Anushka.

Archives Discussions

Problem with Pi calculation