Archives Discussions

jonathan81 · ‎06-02-2008

Hello,

I have made my first project with brook with my ATI RADEON HD 3870

I make sum of 2 matrix (same kind of code given in samples)

I have one version with float and one version with float4

My two kernels are:

kernel void sum(float a<>, float b<>, out float c<>

{

c = exp(a) + exp(b)

}

kernel void sum(float4 a<>, float4 b<>, out float4 c<>

{

c = exp(a) + exp(b)

}

And with 10000 iterations of the kernel , i don't see that with float4 my code are more faster , i have the same time approximately (in CPU i have 10 times the time on GPU)

Thanks a lot

regards

Jonathan

michael_chu · ‎06-10-2008

Hi Jonathan,

In this case, 2 possibilities here:
- First, you are using exp() which is a transcendental. As a result, you are going to be confined to the t unit of the thread processors. (which means the float4 isn't going to get you more parallelism in a single thread processor since you've run out of functional units to process your instruction)
- Second, sometimes the compiler will also do transformations like that for you (float4 instead of float). Depends on how easy it is for the compiler to discover that.

I suspect it is the first case that is your bottleneck.

Michael.

jonathan81 · ‎06-10-2008

Thanks a lot

However when i compare the two projects simple_matmult and optimized_matmult with float4 and four blocks with 100 iterations.

Simple_matmul is faster than optimized_matmult that's very strange

Thks

regards

Jonathan

Archives Discussions

Speed up with Float4