I'm currently working on my Master Thesis and i need some help. I'm writing about possible accelerations using OpenCL.

Now I want to describe ATI Stream GPUs and Nvidia Cuda Cores. But I'm not sure if I really got it right.

ATI GPUs:

A SIMD Core contains 16 Stream Processors (SP) which has 4 Stream Processing Units (SPU) + 1 SFU + Branch Unit + some General Purpose Registers. The SFU can also act like normal SPU. Therefore a Cypress GPU has 5 SPUs * 16 * 20 = 1600 SPUs.

Theoretical FLOPS:

FLOPS (SP) = cores * 2 (FMA) * GHz

FLOPS (DP) = cores/5*2 (2 SPUs = 1DP) * GHz

Is FMA at DP possible? because this would mean that FLOPS(DP) has to be multiplied by 2.

5D-Shader-Unit:

Why 5D-Shader? I read that this comes from the GPUs original purpose (graphics visualisation) and it is due to 5D-Vectors (conatining color values RGBA and ?) needed for this. Is this ture?

Instruction Level Parallelism:

It depends on the compiler to optimize code for 5D-Shader-Units. But if there are more independent instructions within a kernel more SPUs can be used. Can the usage of 5D-Vectors within the kernel improve performance?

Eg. consider a kernel that with the following instructions.

A = 1+1; B = 1+1; C = 1+1; D = 1+1; E = 1+1;

As far as i understand this is optimal for 5D-Units because the instructions are independent from each other so the can be executed at 1 CPU cycle by a SP. On a Cuda Core this instructions need 5 CPU cycles.

If the kernel looks like this:

A = 1+1; B = A+1; C = B+1; ...

Each SP can only execute 1 instrucion per CPU cycle.

I got the most information from http://www.anandtech.com/print/2556

I draw the ATI Cypress SIMD Core Architecture.

An image can be found at

http://noxnet.at/wp-content/ati_cypress_simd_core2.jpg

Could anyone please look at it and correct me if something is wrong?