Stream KernelAnalyzer says that my kernel went from 158 million kernels per second to 222 million kernels per second when I switched one cos and one sine to native_cos and native_sin
The weird part is that the basic structure of the code is this:
1. setup, including a call to cos(float4)
2. loop doing 256 iterations
3. teardown, including a call to sin(float4)
All I did was change the two trig functions. Why does the SKA tool think I've invented a new sliced bread? The wall clock certainly doesn't agree with that.