Brook+ 1.4 performance seems to be very limited, CAL is much faster

hendrix on Mar 19, 2009
ryta1203 on May 26, 2009

I simple tried the matmul samples from the ATI stream SDK 1.4 and found that the Brook+ sample "optimized matmult" with 256x256 matrices achieves only 0,146 GFLOPS. Increasing the size to 1024x1024 the result was 12,7 GFLOPS.

The CAL-sample "simple matmult" runs with 55,4 GFLOPS (size 256 x 256), which is more than 4 times faster as the best Brook+ result.

Can anyone explain, why Brook+ suffers from such a high overhead, or what else limits Brook+ efficiency ?

I am using a Radeon HD 2600 XT with 120 stream processors and 192 GFLOPS peak performance (Windows XP SP3, 32 bit). So the CAL-sample "compute_matmul" didn´t start, cause the GPU dosn´t support compute kernels.