How short vectors are processed by stream processors
Hello,
Can you explain one quite simple thing regarding short vectors (like float4) ?
I have a very simple kernel:
kernel void sum(float4 a<>, float4 b<>, out float4 c<>)
{
c = a + b;
}
And I have Radeon HD 4850 that has 800 stream processors.
I do not understand how many floats one stream processor can handle at a moment when you use float4. One float4 is 4 floats. Does it mean that one stream processor can handle 4 floats at a moment ?
And does it mean that HD 4850 can do operations on 800 float4 variables simultaneously ? It means a simultaneous processing of 3200 floats which is a lot. I just cannot believe this and a simultaneous processing of 800 floats looks more credible.
Can anybody shed some light on this ?
Best regards,
Poozon