    Speed up with Float4



      I have made my first project with brook with my ATI RADEON HD 3870

      I make sum of 2 matrix (same kind of code given in samples)

      I have one version with float and one version with float4

      My two kernels are:

      kernel void sum(float a<>, float b<>, out float c<>


       c = exp(a) + exp(b)


      kernel void sum(float4 a<>, float4 b<>, out float4 c<>


       c = exp(a) + exp(b)


      And with 10000 iterations of the kernel , i don't see that with float4 my code are more faster , i have the same time approximately (in CPU i have 10 times the time on GPU)

          Hi Jonathan,

          In this case, 2 possibilities here:
          - First, you are using exp() which is a transcendental. As a result, you are going to be confined to the t unit of the thread processors. (which means the float4 isn't going to get you more parallelism in a single thread processor since you've run out of functional units to process your instruction)
          - Second, sometimes the compiler will also do transformations like that for you (float4 instead of float). Depends on how easy it is for the compiler to discover that.

          I suspect it is the first case that is your bottleneck.