Hello,
i run some test to view Gflops on my Radeon 3870
In example simple_matmult i use those arguments : -t -i 10000 and i add the line to have the Width Height Time and Gflops (the same thing that in optimized_matmult)
I have with simple_matmult :
Width Height Iterations Time Gflops
64 64 10000 0.718681 6.794130
with optimized_matmult:
Width Height Iterations Time Gflops
64 64 10000 1.084024 4.504341
CPU Time : 15.086446
Can you give me your results with those two samples projects ??
i find that's results about GFlops are bad and optimized_matmult takes more time that simple_matmult that's very weird.
Thanks in advance
Jonathan
Edit :
Better result with 1024 height and width and 100 iterations
simple_matmult :
Width Height Iterations Time Gflops
1024 1024 100 20.184839 9.908427
optimized_matmult
Width Height Iterations Time Gflops
1024 1024 100 2.112175 94.689126
Hi,
Here are results on my new Radeon 4870:
with simple_matmult :
Width Height Iterations Time Gflops
64 64 10000 0.571594 8.542454
Width Height Iterations Time Gflops
640 640 100 2.423211 20.150175
with optimized_matmult:
Width Height Iterations Time Gflops
64 64 10000 1.075558 4.539794
Width Height Iterations Time Gflops
640 640 100 0.852667 57.265160
As you can see in second test the result are pretty different.
I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.
Edit
with simple_matmult :
Width Height Iterations Time Gflops
1024 1024 100 10.006094 19.987819
with optimized_matmult:
Width Height Iterations Time Gflops
1024 1024 100 1.401922 142.661312
Remo
Thanks
Initialization in optimized_matmult is more expansive than simple_matmult so with big iterations and small size we lose the power of optimization because kernel are so small
Originally posted by: Remotion
Here are results on my new Radeon 4870:
Originally posted by: Remotion
I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.
Originally posted by: RemotionWidth Height Iterations Time Gflops
1024 1024 100 1.401922 142.661312
Here is test with 4096*4096 using optimized_matmult:
Width Height Iterations Time Gflops
4096 4096 100 50.357503 254.182580
I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.
Width Height Iterations Time Gflops
6144 6144 1 2.919084 147.991647
This last test war prety long and it apears that my system is hanging but here is the result.
Width Height Iterations Time Gflops
6144 6144 100 168.628834 256.183946
So it seems that 256 Gflops is the peak.
Originally posted by: Remotion
I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.
Originally posted by: RemotionSo it seems that 256 Gflops is the peak.
eduardoschardong,
I tried your suggestion above ("accumulator1 +=" to "accumulator1 = accumulator1 +") but was not able to see the change of MULs, ADDs to MULADDs. Would you let me know how you tested it?