cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

kbrafford
Adept II

Wow..what did I really do when I switched to native_sin and native_cos?

Stream KernelAnalyzer says that my kernel went from 158 million kernels per second to 222 million kernels per second when I switched one cos and one sine to native_cos and native_sin

The weird part is that the basic structure of the code is this:

1. setup, including a call to cos(float4)

2. loop doing 256 iterations

3. teardown, including a call to sin(float4)

All I did was change the two trig functions.  Why does the SKA tool think I've invented a new sliced bread?  The wall clock certainly doesn't agree with that.

--Keith Brafford

Tags (1)
0 Likes
19 Replies
MicahVillmow
Staff
Staff

Wow..what did I really do when I switched to native_sin and native_cos?

sin/cos are required to fulfill the LLVM requirements for OpenCL floating point accuracy. Native_* versions have no such restraints and compile down to a single hardware instruction. The trig functions are not single instruction functions in most cases.
0 Likes
kbrafford
Adept II

Wow..what did I really do when I switched to native_sin and native_cos?

Would I be better off calculating the cos and sin values on the CPU and passing them into the kernels?

I am having a hard time understanding why the SKA tool predicts such a massive speed-up that simply doesn't happen in reality.

0 Likes
MicahVillmow
Staff
Staff

Wow..what did I really do when I switched to native_sin and native_cos?

If your kernel is not ALU bound, it won't speed up by decreasing the amount of ALU to compute the results.
0 Likes
eugenek
Journeyman III

Wow..what did I really do when I switched to native_sin and native_cos?

native_cos with float argument is 1 hardware instruction. cos with float argument is around 200 instructions if the argument is less than 3140000, and closer to 300 instructions otherwise.

 

Would I be better off calculating the cos and sin values on the CPU and passing them into the kernels?


Not really. The CPU takes an awful lot of time to compute sin/cos as well. The GPU can manage about 4 billion cosines per second. The CPU can do 200 million cosines per second.

0 Likes
Jawed
Adept II

Wow..what did I really do when I switched to native_sin and native_cos?

Stand back in awe of the attached code, which results in 142 ALU instructions, 10 fetches, 3 writes, 62 GPRs and 2 scratch registers on HD5870 

kernel void test(global float *A, global float *B) { int pos = get_global_id(0); B[pos] = powr(A[pos], A[pos + 1]); }

0 Likes
himanshu_gautam
Grandmaster

Wow..what did I really do when I switched to native_sin and native_cos?

Jawed,

Where did you got those numbers? I tried your code in SKA and get 178 Instruction clauses and it uses 11 GPRs. With native version it is 11 clauses and 3 GPRs. The number of reads are 2 and write is 1 in both cases as expected.

0 Likes
Jawed
Adept II

Wow..what did I really do when I switched to native_sin and native_cos?

SKA 1.7 with SDK 2.3 installed.

I suspect you have a different SDK installed.

If I use native_powr() then I get 10 ALUs and 3 GPRs (I suspect you meant 10 not 11).

Why isn't native_pow() defined?

0 Likes
eugenek
Journeyman III

Wow..what did I really do when I switched to native_sin and native_cos?

Originally posted by: himanshu.gautam Jawed,

 

Where did you got those numbers? I tried your code in SKA and get 178 Instruction clauses and it uses 11 GPRs. With native version it is 11 clauses and 3 GPRs. The number of reads are 2 and write is 1 in both cases as expected.

 

 

Maybe you're compiling for a different GPU? I also see 62 GPRs and 10 fetches on a 5870, 63 GPRs and 8 fetches on a 6970.

 

Your CPU may be doing something similar when you ask it to compute a power, it's just not so blatantly obvious. On an Intel Core Duo, the two instructions that do the bulk of the job inside pow() take 165 clock ticks.

0 Likes
himanshu_gautam
Grandmaster

Wow..what did I really do when I switched to native_sin and native_cos?

Can you post the IL\ISA generated without native function.

0 Likes