Kazushige Goto's BLAS is hand crafted ISA optimized math kernels. It is likely to be near the maximum performance possible in practical situations. It achieves about 67% utilization at 2048 on your chart.
Demo and sample code is generally not highly optimized. It is written to be easy to understand. So performance is lower.
Tuning and optimization is required for high performance. This is not unique to GPUs. High performance CPU BLAS implementations like gotoBLAS (hand crafted ISA) and ATLAS (auto-tuning) have large investments in tuning and optimization to fit the hardware and compiler. It's not simple. High performance kernels are not a trivial problem that translators like the GCC or OpenCL compilers can do themselves. Optimizations at a higher level are required.