I'm currently working on my Master Thesis and i need some help. I'm writing about possible accelerations using OpenCL.
Now I want to describe ATI Stream GPUs and Nvidia Cuda Cores. But I'm not sure if I really got it right.
A SIMD Core contains 16 Stream Processors (SP) which has 4 Stream Processing Units (SPU) + 1 SFU + Branch Unit + some General Purpose Registers. The SFU can also act like normal SPU. Therefore a Cypress GPU has 5 SPUs * 16 * 20 = 1600 SPUs.
FLOPS (SP) = cores * 2 (FMA) * GHz
FLOPS (DP) = cores/5*2 (2 SPUs = 1DP) * GHz
Is FMA at DP possible? because this would mean that FLOPS(DP) has to be multiplied by 2.
Why 5D-Shader? I read that this comes from the GPUs original purpose (graphics visualisation) and it is due to 5D-Vectors (conatining color values RGBA and ?) needed for this. Is this ture?
Instruction Level Parallelism:
It depends on the compiler to optimize code for 5D-Shader-Units. But if there are more independent instructions within a kernel more SPUs can be used. Can the usage of 5D-Vectors within the kernel improve performance?
Eg. consider a kernel that with the following instructions.
A = 1+1; B = 1+1; C = 1+1; D = 1+1; E = 1+1;
As far as i understand this is optimal for 5D-Units because the instructions are independent from each other so the can be executed at 1 CPU cycle by a SP. On a Cuda Core this instructions need 5 CPU cycles.
If the kernel looks like this:
A = 1+1; B = A+1; C = B+1; ...
Each SP can only execute 1 instrucion per CPU cycle.
I got the most information from http://www.anandtech.com/print/2556