Hi,
I'd like to calculate Pi with Gregory-Leibniz series with this kernel:
kernel void pi_GregoryLeibniz(float size, out float output<>
{
float2 pos = indexof(output).xy;
float i = size * pos.x + pos.y;
output = pow(-1.0f, i) * (4.0f / (1.0f + 2.0f * i));
}
then sum it with reduce kernel
kernel void pi_sum(float input<>, reduce float output)
{
output += input;
}
I get just a messy output: -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J -1.#J when I'm printing the output on the screen ... The result of the reduce kernel's always 0.0 regardless to the content of the input stream.
What's the matter with my code?
My second question is how can i use multiple kernels in my program? I've tried to use output of the first kernel as input for second like this:
pi_GregoryLeibniz(size, outputStream);
pi_sum(outputStream);
Thanks for advices.
1) Your "reduce" kernel needs to be called such, right now it's just a regular kernel with a reduce function as an output. The function (kernel) itself has to be declared reduce. You can look at the reduce example in the SDK.
2) Your second thing shouldn't be a problem. Fix the first thing first and see if that is still a problem.
Hi ryta,
thanks for the answer. You're right, replacing 'kernel' with 'reduce' made the trick but the output is the same mess. Maybe something's wrong with the output = ... line. If I change the right side just for 'i' (output = i;) everything works fine. Is it bug in brcc or I'm missing something?
1) When you call pi_sum(outputStream) where is the output variable?
2) There is something wrong with the "pow()" function call. If you comment that out and look at the outputStream before you call pi_sum, you get valid output. If you include the "pow()" call, you don't.
3) -1^n = -1. If you look at the docs, pow(x,y) = x^y, so you are taking -1 to the i. Is that what you intended? I'm unfamiliar with the algorithm, but if that is what you want and -1^i = -1, just don't do this step.
Hi,
1) outputStream is in 'gpu memory', I don't write it out, just simply use the output of 1st kernel as input for the 2nd one.
2) Exactly. Something's wrong with pow() function ...
3) -1^n = -1, if n = 2k+1(odd) and -1^n = 1, if n = 2k (even). I've looked into docs and pow()'s what I want. Unfortunately it doesn't work for me.
Does CPU backend produce correct output?
Yes.
johnnyb,
Just do the "pow()" yourself.
kernel void pi_GregoryLeibniz(float size, out float output<>
{
float2 pos = indexof(output).xy;
float i = size * pos.x + pos.y;
int x;
float temp=1.0f;
//output = (pow(-1.0f, i)) * (4.0f / (1.0f + 2.0f * i));
for(x=0;x<(int)i;x++)
{
temp *=-1.0f;
}
output = temp*(4.0f / (1.0f + 2.0f * i));
}
With a size of 1.0f and an outputStream size of 1000 (1000 iterations according to the algorithm), I get 3.14..... (so that's pretty close) as the result using this kernel.
Each thread will take the same path, so this shouldn't (this is just an educated guess, someone from AMD could be more precise) effect performance much, if at all.
This DOES effect your GPR usage, ALU:Fetch ratio, Throughput, etc...
Your original kernel appears to be much better according to the KSA, but this might do the trick until you figure out what is wrong.
This brings up a question:
1) How does the "pow()" function work?
2) Is it possible to use a reduce function as a solution here, since that is essentially what a pow() is? I can't think of a way right off the top of my head.
Hi ryta,
thanks for the tip. Unfortunately the workaround is much much slower than the pow(). It's interesting that the code works perfectly on CPU runtime and crashes and burns on CAL. I hope that'll get a _real_ solution for this problem.
BTW thanks for your time 🙂
Yep, looks like "pow()" is BROKEN. I guess they didn't test all possible cases when rolling this thing out. This needs to be reported as a bug. There is no error with the resulting stream either.
It has a problem when getting a negative number for "x". It doesn't seem to like that much at all.
Out of curiousity, how much slowdown did you see on the GPU side by using the for loop johnny?
Here are my results:
GFX: Radeon 4870, SKA 1.1.77
Your code:
GPR: 5, Min: 1.20, Max: 77.40, Avg: 17.59, Est. Cycles: 17.59, ALU:Fetch: 17.59, Thread/Clock: 0.91, Throughput: 682 M Threads/Sec
Your code (optimized by me - replacing int for float)
GPR: 5, Min: 1.10, Max: 51.90, Avg: 12.03, Est. Cycles: 12.03, ALU:Fetch: 12.03, Thread/Clock: 1.33, Throughput: 998 M Threads/Sec
Original (my) code:
GPR: 2, Min: 1.00, Max: 1.00, Avg: 1.00, Est. Cycles: 1.00, ALU:Fetch: 1.00, Thread/Clock: 16.00, Throughput: 12000 M Threads/Sec
I can't measure for you due to pow() bug but the assembly code looks more complicated than with pow().
Yes, I'm aware of the KSA stats, as I mentioned that earlier.
HOWEVER, does the KSA take into account the code for the actually "pow()" function?
Hi johnnyb;
I am doing similar problem using a array,,Finally I got the output stream what i wanted,Then I got a problem when I tried to get the max,min avg values of that stream ,could u tell me how to get that,,Pls,,
Thank you,
Anushka.