Am new to opencl and was used to cuda and nvidia gpus.
(Excuse me for using cuda terms here)
I thought a warp(32 threads) goes to 8 SPs( 4 threads to each SP ) in an SM
I was going through online examples given by AMD: http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study_7.aspx
it says using vectors in openCL increases throughput in GPU. now is it like 1 thread goes to 1 sp instead of 4 threads?
Can someone explain how does it improve performance in the hardware level.