Archives Discussions

jonathan81 · ‎07-09-2008

Hello,

i run some test to view Gflops on my Radeon 3870

In example simple_matmult i use those arguments : -t -i 10000 and i add the line to have the Width Height Time and Gflops (the same thing that in optimized_matmult)

I have with simple_matmult :

Width Height Iterations Time Gflops
64 64 10000 0.718681 6.794130

with optimized_matmult:

Width Height Iterations Time Gflops
64 64 10000 1.084024 4.504341

CPU Time : 15.086446

Can you give me your results with those two samples projects ??

i find that's results about GFlops are bad and optimized_matmult takes more time that simple_matmult that's very weird.

Thanks in advance

Jonathan

Edit :

Better result with 1024 height and width and 100 iterations

simple_matmult :

Width Height Iterations Time Gflops
1024 1024 100 20.184839 9.908427

optimized_matmult

Width Height Iterations Time Gflops
1024 1024 100 2.112175 94.689126

Remotion · ‎07-09-2008

Hi,

Here are results on my new Radeon 4870:

with simple_matmult :

Width Height Iterations Time Gflops
64 64 10000 0.571594 8.542454

Width Height Iterations Time Gflops
640 640 100 2.423211 20.150175

with optimized_matmult:

Width Height Iterations Time Gflops
64 64 10000 1.075558 4.539794

Width Height Iterations Time Gflops
640 640 100 0.852667 57.265160

As you can see in second test the result are pretty different.

I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.

Edit

with simple_matmult :

Width Height Iterations Time Gflops
1024 1024 100 10.006094 19.987819

with optimized_matmult:

Width Height Iterations Time Gflops
1024 1024 100 1.401922 142.661312

Remo

jonathan81 · ‎07-09-2008

Thanks

Initialization in optimized_matmult is more expansive than simple_matmult so with big iterations and small size we lose the power of optimization because kernel are so small

eduardoschardong · ‎07-09-2008

Originally posted by: Remotion
Here are results on my new Radeon 4870:

Great card

Originally posted by: Remotion
I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.

Yep, those streams are too small for GPUs,

Originally posted by: RemotionWidth Height Iterations Time Gflops
1024 1024 100 1.401922 142.661312

But 1024 seens to be too small for 4870 either, just to find the peak, could you run a single interation with 8192x8192?

Thanks in advance.

Remotion · ‎07-10-2008

Here is test with 4096*4096 using optimized_matmult:

Width   Height Iterations      Time            Gflops
4096    4096    100             50.357503       254.182580

I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

Width   Height Iterations      Time            Gflops
6144    6144    1               2.919084        147.991647

This last test war prety long and it apears that my system is hanging but here is the result.
Width Height Iterations Time Gflops
6144 6144 100 168.628834 256.183946

So it seems that 256 Gflops is the peak.

eduardoschardong · ‎07-12-2008

Originally posted by: Remotion
I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

I told 8192x8192 due the texture size limit and forgot about instaled memory...
7168x7168x4x3 = 588MB
6144x6144x4x3 = 432MB

Originally posted by: RemotionSo it seems that 256 Gflops is the peak.

Much higher than any CPU but still far below the maximum 1.2TFlops, I am trying to find what is limiting it, maybe texture fetchs? In this case a more_optimized_matmult will solve...
Looking at the disassembly it doens't look like it's ALU limited.
BTW, looking at the disassembly a found what seens to be a poor job of the optimizer, it generates a good number of MULs and ADDs, replacing the:
accumulator1 += A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
by
accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
was enough for replacing all MULs and some ADDs by MULADDs, and even with fast math enabled, also removing the i0 and modifying the while the compiler generated 3 MOVs at the end wich I didn't understood, my loop:

while(index.w < loopVar0)
{
// Fetching values from A
float4 A11 = A1[index.wy];
float4 A22 = A2[index.wy];
float4 A33 = A3[index.wy];
float4 A44 = A4[index.wy];
float4 A55 = A5[index.wy];
float4 A66 = A6[index.wy];
float4 A77 = A7[index.wy];
float4 A88 = A8[index.wy];

// Fetching values from B
float4 B11 = B1[index.xw];
float4 B22 = B2[index.xw];
float4 B33 = B3[index.xw];
float4 B44 = B4[index.xw];

accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
accumulator2 = accumulator2 + A22.xxxx * B11.xyzw + A22.yyyy * B22.xyzw + A22.zzzz * B33.xyzw + A22.wwww * B44.xyzw;
accumulator3 = accumulator3 + A33.xxxx * B11.xyzw + A33.yyyy * B22.xyzw + A33.zzzz * B33.xyzw + A33.wwww * B44.xyzw;
accumulator4 = accumulator4 + A44.xxxx * B11.xyzw + A44.yyyy * B22.xyzw + A44.zzzz * B33.xyzw + A44.wwww * B44.xyzw;
accumulator5 = accumulator5 + A55.xxxx * B11.xyzw + A55.yyyy * B22.xyzw + A55.zzzz * B33.xyzw + A55.wwww * B44.xyzw;
accumulator6 = accumulator6 + A66.xxxx * B11.xyzw + A66.yyyy * B22.xyzw + A66.zzzz * B33.xyzw + A66.wwww * B44.xyzw;
accumulator7 = accumulator7 + A77.xxxx * B11.xyzw + A77.yyyy * B22.xyzw + A77.zzzz * B33.xyzw + A77.wwww * B44.xyzw;
accumulator8 = accumulator8 + A88.xxxx * B11.xyzw + A88.yyyy * B22.xyzw + A88.zzzz * B33.xyzw + A88.wwww * B44.xyzw;

index.w += 1.0f;
// Reducing iterator
//i0 = i0 - 1.0f;
}

bjang · ‎08-22-2008

eduardoschardong,

I tried your suggestion above ("accumulator1 +=" to "accumulator1 = accumulator1 +") but was not able to see the change of MULs, ADDs to MULADDs. Would you let me know how you tested it?

eduardoschardong · ‎08-23-2008

Older version of the compiler... kind of
I messed environment variables from alpha and beta and got those results, with the current compiler and right variables I didn't saw a single MULADD.

Try using the alpha, if it dond't work, try messing version like I did

Archives Discussions

Gflops with Simple_matmult and optimized_matmult