Benchmarking OpenCL vs. CAL [with Hazeman's C++ bindings]

Discussion created by blelump on May 8, 2010
Latest reply on May 9, 2010 by LeeHowes


First of all, I really appreciate such tremendous amount of work with C++ CAL bindings, which Hazeman has done.

For those, who don't know what I'm talking about:

As Hazeman mentioned in this topic: , porting CAL++ to OpenCL is really straightforward. If so, my adventure started with Peekflops program, which points good results on my 4850 card [actually it gives like ~960Gflops for single precision FLOP]. However while porting it to OpeCL, the performance decreases dramatically and it's like 10 times worse than CAL one [max Gflops I achieved is ~150 for single precision FLOP]. Kernel code looks quite similar:


__kernel void benchmark1(
      __global float4 *result) {

  float4 a,b;
  a = (float4)(4.2);
  b = (float4)(4.2);

  for(uint i=0;i
    for(uint k=0;k<(NR_MAD_INST/2);++k) {
    a = mad(a,a,a);
    b = mad(b,b,b);
    a = mad(a,a,a);

  result[get_global_id(0)] = a+b;

Has anyone ever tried such benchmark with OpenCL? I have also checked another one and it seems that OpenCL implementation is really much slower. Why is that?