Archives Discussions

bdoshi · ‎04-22-2013

Hi

I am using laptop with RADEON HD-6290. Rated peak performance of gpu is 44 gflops (checked on wikipedia). When i am running OpenCL sample examples (provided by AMD) on gpu, it is giving performance of 5-6 G instructions. Why such a large difference between peak vs actual capacity.

Please find attached file for more detail.

2nd last column is instructions/sec (calculated as Total work item * ALU instructions * 1000 / time (ms)).

Last column is instructions/sec normalized to 100% ALU busy.

himanshu_gautam · ‎04-22-2013

The samples are not just running ALU instructions alone.

When you access memory, you will loose time unless latencies are hidden completely.

Even ALU intense operations like matrix multiplication hardly reach 50 to 60% usage in single precision.

bdoshi · ‎04-22-2013

Hi Himanshu,

APP profiler detail for matrix multiplication program(from AMD OpenCL samples).

GlobalWorkSize: 256

ALU Instruction: 513

Time: 0.075 ms

Fetch Instruction 128

Instructions/sec = 256*(513+128)*1000/0.075 = 2.15 G instructions/sec

What are the possible reasons for not achieving 50 to 60% (As you mentioned Matrix Multiplication can go up to 50-60%) of capacity ?

himanshu_gautam · ‎04-22-2013

Are you running the matrixmultiplication sample for global size=256? That is too small for GPU acceleration.

You should try the sample with options like -x 2048 -y 2048 -z 2048 -i 50. Even then APP SDK Samples have not been optimized to the maximum limit. You can try AMD's BLAS library (GEMM routine) and report your results.

bdoshi · ‎04-22-2013

Tried with -x 2048 -y 2048 -z 2048 -i 50.

Global Work Size : 512 x 512

ALU instructions : 15377.77

Fetch Instructions: 4094

Time: 850ms

Instructions/sec : 512*512*(15377+4094)*1000/850 = 6 G instructions/sec

Will update you on results with AMD's BLAS library.

Note: Here kernel occupancy is 25%. Will 50% kernel occupancy will double my instructions/sec?

himanshu_gautam · ‎04-22-2013

Kernel Occupancy 25% is quite bad, but you seem to be having enough number of workgroups already. Try CodeXL to know the reason for such low kernel occupancy. You can also try matrixmulImage sample. That is known to give better performance.

Archives Discussions

Difference between rated peak performance and actual performance