I simple tried the matmul samples from the ATI stream SDK 1.4 and found that the Brook+ sample "optimized matmult" with 256x256 matrices achieves only 0,146 GFLOPS. Increasing the size to 1024x1024 the result was 12,7 GFLOPS.
The CAL-sample "simple matmult" runs with 55,4 GFLOPS (size 256 x 256), which is more than 4 times faster as the best Brook+ result.
Can anyone explain, why Brook+ suffers from such a high overhead, or what else limits Brook+ efficiency ?
I am using a Radeon HD 2600 XT with 120 stream processors and 192 GFLOPS peak performance (Windows XP SP3, 32 bit). So the CAL-sample "compute_matmul" didn´t start, cause the GPU dosn´t support compute kernels.