The samples are not just running ALU instructions alone.
When you access memory, you will loose time unless latencies are hidden completely.
Even ALU intense operations like matrix multiplication hardly reach 50 to 60% usage in single precision.
APP profiler detail for matrix multiplication program(from AMD OpenCL samples).
ALU Instruction: 513
Time: 0.075 ms
Fetch Instruction 128
Instructions/sec = 256*(513+128)*1000/0.075 = 2.15 G instructions/sec
What are the possible reasons for not achieving 50 to 60% (As you mentioned Matrix Multiplication can go up to 50-60%) of capacity ?
1 of 1 people found this helpful
Are you running the matrixmultiplication sample for global size=256? That is too small for GPU acceleration.
You should try the sample with options like -x 2048 -y 2048 -z 2048 -i 50. Even then APP SDK Samples have not been optimized to the maximum limit. You can try AMD's BLAS library (GEMM routine) and report your results.
Tried with -x 2048 -y 2048 -z 2048 -i 50.
Global Work Size : 512 x 512
ALU instructions : 15377.77
Fetch Instructions: 4094
Instructions/sec : 512*512*(15377+4094)*1000/850 = 6 G instructions/sec
Will update you on results with AMD's BLAS library.
Note: Here kernel occupancy is 25%. Will 50% kernel occupancy will double my instructions/sec?
Kernel Occupancy 25% is quite bad, but you seem to be having enough number of workgroups already. Try CodeXL to know the reason for such low kernel occupancy. You can also try matrixmulImage sample. That is known to give better performance.