cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jonathan81
Journeyman III

Gflops with Simple_matmult and optimized_matmult

Hello,

i run some test to view Gflops on my Radeon 3870

In example simple_matmult i use those arguments : -t -i 10000 and i add the line to have the Width Height Time and Gflops (the same thing that in optimized_matmult)

I have with simple_matmult :

Width   Height  Iterations      Time            Gflops
64      64          10000           0.718681        6.794130

with optimized_matmult:

Width   Height  Iterations      Time            Gflops
64      64          10000           1.084024        4.504341

CPU Time : 15.086446

Can you give me your results with those two samples projects  ??

 i find that's results about GFlops are bad and optimized_matmult takes more time that simple_matmult that's very weird.

Thanks in advance

Jonathan

Edit :

Better result with 1024 height and width and 100 iterations

simple_matmult :

Width   Height  Iterations      Time            Gflops
1024    1024    100             20.184839       9.908427

optimized_matmult

Width   Height  Iterations      Time            Gflops
1024    1024    100             2.112175        94.689126

 

 

 

0 Likes
7 Replies
Remotion
Journeyman III

Hi,


Here are results on my new Radeon 4870:

with simple_matmult :

Width   Height  Iterations      Time            Gflops         
64      64      10000           0.571594        8.542454 

Width   Height  Iterations      Time            Gflops         
640     640     100             2.423211        20.150175  

 

with optimized_matmult:

Width   Height  Iterations      Time            Gflops         
64      64      10000           1.075558        4.539794  

Width   Height  Iterations      Time            Gflops         
640     640     100             0.852667        57.265160

As you can see in second test the result are pretty different.

I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.

 

Edit

with simple_matmult :

Width   Height  Iterations      Time            Gflops         
1024    1024    100             10.006094       19.987819

with optimized_matmult:

Width   Height  Iterations      Time            Gflops         
1024    1024    100             1.401922        142.661312

Remo

0 Likes

Thanks

Initialization in optimized_matmult is more expansive than simple_matmult so with big iterations and small size we lose the power of optimization because kernel are so small

 

0 Likes

Originally posted by: Remotion
Here are results on my new Radeon 4870:



Great card

Originally posted by: Remotion
I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.



Yep, those streams are too small for GPUs,

Originally posted by: RemotionWidth   Height  Iterations      Time            Gflops         
1024    1024    100             1.401922        142.661312



But 1024 seens to be too small for 4870 either, just to find the peak, could you run a single interation with 8192x8192?

Thanks in advance.
0 Likes

Here is test with 4096*4096 using optimized_matmult:

Width   Height  Iterations      Time            Gflops         
4096    4096    100             50.357503       254.182580     

I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

Width   Height  Iterations      Time            Gflops         
6144    6144    1               2.919084        147.991647

This last test war prety long and it apears that my system is hanging but here is the result.
Width   Height  Iterations      Time            Gflops          
6144    6144    100             168.628834      256.183946   

So it seems that 256 Gflops is the peak.   

0 Likes

Originally posted by: Remotion
I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

I told 8192x8192 due the texture size limit and forgot about instaled memory...
7168x7168x4x3 = 588MB
6144x6144x4x3 = 432MB

Originally posted by: RemotionSo it seems that 256 Gflops is the peak.


Much higher than any CPU but still far below the maximum 1.2TFlops, I am trying to find what is limiting it, maybe texture fetchs? In this case a more_optimized_matmult will solve...
Looking at the disassembly it doens't look like it's ALU limited.
BTW, looking at the disassembly a found what seens to be a poor job of the optimizer, it generates a good number of MULs and ADDs, replacing the:
accumulator1 += A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
by
accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
was enough for replacing all MULs and some ADDs by MULADDs, and even with fast math enabled, also removing the i0 and modifying the while the compiler generated 3 MOVs at the end wich I didn't understood, my loop:

while(index.w < loopVar0)
{
// Fetching values from A
float4 A11 = A1[index.wy];
float4 A22 = A2[index.wy];
float4 A33 = A3[index.wy];
float4 A44 = A4[index.wy];
float4 A55 = A5[index.wy];
float4 A66 = A6[index.wy];
float4 A77 = A7[index.wy];
float4 A88 = A8[index.wy];

// Fetching values from B
float4 B11 = B1[index.xw];
float4 B22 = B2[index.xw];
float4 B33 = B3[index.xw];
float4 B44 = B4[index.xw];

accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
accumulator2 = accumulator2 + A22.xxxx * B11.xyzw + A22.yyyy * B22.xyzw + A22.zzzz * B33.xyzw + A22.wwww * B44.xyzw;
accumulator3 = accumulator3 + A33.xxxx * B11.xyzw + A33.yyyy * B22.xyzw + A33.zzzz * B33.xyzw + A33.wwww * B44.xyzw;
accumulator4 = accumulator4 + A44.xxxx * B11.xyzw + A44.yyyy * B22.xyzw + A44.zzzz * B33.xyzw + A44.wwww * B44.xyzw;
accumulator5 = accumulator5 + A55.xxxx * B11.xyzw + A55.yyyy * B22.xyzw + A55.zzzz * B33.xyzw + A55.wwww * B44.xyzw;
accumulator6 = accumulator6 + A66.xxxx * B11.xyzw + A66.yyyy * B22.xyzw + A66.zzzz * B33.xyzw + A66.wwww * B44.xyzw;
accumulator7 = accumulator7 + A77.xxxx * B11.xyzw + A77.yyyy * B22.xyzw + A77.zzzz * B33.xyzw + A77.wwww * B44.xyzw;
accumulator8 = accumulator8 + A88.xxxx * B11.xyzw + A88.yyyy * B22.xyzw + A88.zzzz * B33.xyzw + A88.wwww * B44.xyzw;

index.w += 1.0f;
// Reducing iterator
//i0 = i0 - 1.0f;
}

0 Likes

eduardoschardong,

I tried your suggestion above ("accumulator1 +=" to "accumulator1 = accumulator1 +") but was not able to see the change of MULs, ADDs to MULADDs. Would you let me know how you tested it?

0 Likes

Older version of the compiler... kind of
I messed environment variables from alpha and beta and got those results, with the current compiler and right variables I didn't saw a single MULADD.

Try using the alpha, if it dond't work, try messing version like I did
0 Likes