Hi
I am using laptop with RADEON HD-6290. Rated peak performance of gpu is 44 gflops (checked on wikipedia). When i am running OpenCL sample examples (provided by AMD) on gpu, it is giving performance of 5-6 G instructions. Why such a large difference between peak vs actual capacity.
Please find attached file for more detail.
2nd last column is instructions/sec (calculated as Total work item * ALU instructions * 1000 / time (ms)).
Last column is instructions/sec normalized to 100% ALU busy.
The samples are not just running ALU instructions alone.
When you access memory, you will loose time unless latencies are hidden completely.
Even ALU intense operations like matrix multiplication hardly reach 50 to 60% usage in single precision.
Hi Himanshu,
APP profiler detail for matrix multiplication program(from AMD OpenCL samples).
GlobalWorkSize: 256
ALU Instruction: 513
Time: 0.075 ms
Fetch Instruction 128
Instructions/sec = 256*(513+128)*1000/0.075 = 2.15 G instructions/sec
What are the possible reasons for not achieving 50 to 60% (As you mentioned Matrix Multiplication can go up to 50-60%) of capacity ?
Are you running the matrixmultiplication sample for global size=256? That is too small for GPU acceleration.
You should try the sample with options like -x 2048 -y 2048 -z 2048 -i 50. Even then APP SDK Samples have not been optimized to the maximum limit. You can try AMD's BLAS library (GEMM routine) and report your results.
Tried with -x 2048 -y 2048 -z 2048 -i 50.
Global Work Size : 512 x 512
ALU instructions : 15377.77
Fetch Instructions: 4094
Time: 850ms
Instructions/sec : 512*512*(15377+4094)*1000/850 = 6 G instructions/sec
Will update you on results with AMD's BLAS library.
Note: Here kernel occupancy is 25%. Will 50% kernel occupancy will double my instructions/sec?
Kernel Occupancy 25% is quite bad, but you seem to be having enough number of workgroups already. Try CodeXL to know the reason for such low kernel occupancy. You can also try matrixmulImage sample. That is known to give better performance.