Application: Transformation of 2D geographical coordinates in double format
If I have understood it right, then a thread can always process a max. of 4 floats simultaneously, which would be the equivalent of two doubles. In other words: Each thread is always processed by 4 ALUs (or however you call them here) in conjunction.
My kernels look somewhat like this:
kernel void transform (double xIn<>, double yIn<>, out double xOut<>, out double yOut<>)
xOut = >some function on xIn<;
yOut = >some function on yIn<;
Now if I would use double2 instead of double, would that theoretically (at least for simple kernels) double the kernel throughput as it would keep all thread ALUs busy?