Archives Discussions

blelump · ‎05-08-2010

Hi,

First of all, I really appreciate such tremendous amount of work with C++ CAL bindings, which Hazeman has done.

For those, who don't know what I'm talking about: https://sourceforge.net/projects/calpp/

As Hazeman mentioned in this topic: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=130837 , porting CAL++ to OpenCL is really straightforward. If so, my adventure started with Peekflops program, which points good results on my 4850 card [actually it gives like ~960Gflops for single precision FLOP]. However while porting it to OpeCL, the performance decreases dramatically and it's like 10 times worse than CAL one [max Gflops I achieved is ~150 for single precision FLOP]. Kernel code looks quite similar:

__kernel void benchmark1(
      __global float4 *result) {

float4 a,b;
a = (float4)(4.2);
b = (float4)(4.2);

for(uint i=0;i
    for(uint k=0;k<(NR_MAD_INST/2);++k) {
    a = mad(a,a,a);
    b = mad(b,b,b);
    }
    a = mad(a,a,a);
}

result[get_global_id(0)] = a+b;
}

Has anyone ever tried such benchmark with OpenCL? I have also checked another one and it seems that OpenCL implementation is really much slower. Why is that?

Lev · ‎05-08-2010

Is it possible that opencl uses smaller thread group size? I may suspect that in one case inner loop is unrolled.

n0thing · ‎05-08-2010

Unroll the loop! I changed your kernel to following and the ISA is much better!

__kernel void benchmark1(float4 a, float4 b, __global float4 *result) { a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); a = mad(a,a,a); b = mad(b,b,b); result[get_global_id(0)] = a+b; }

blelump · ‎05-08-2010

Well, indeed that is the point, but it doesn't solve the problem. It is due to benchmark execution, which is now too fast. Furthermore the core of kernel more or less looks like below:

...
FMAD128(a, b); //it is a macro, which defines 128 mul and add operations
...

Up to 200 such macros specified in kernel results in execution of kernel takes just a few miliseconds. Measurement time is too small to gain correct result of benchmark.

in case of loops within SIMD
On the other hand, as I understand SIMD idiom, it says that single instruction is executed by the work-items at the same time. So in case of 'loops' - how does SIMD apply to them? How to define single instruction within loop in context of SIMD? I mean, in this case the whole loop is treated as a single instruction or single instruction is just a part of loop iteration, which executes untill loop ends. Second choice seems to be more appriopriate, but then loops should not cause performance drop. But they do, why is that? I don't get it 😕
Besides that, CAL kernel execution feels fine with loops. With Hazeman's C++ bindings, it looks almost identical to the kernel put in my first post of this topic.

LeeHowes · ‎05-09-2010

So I think Hazeman's explanation of the C++ meta-cal magic suggests that his CAL++ stuff is actually generating CAL code rather than taking a simple CAL kernel and executing it, as the OpenCL version would be doing, am I right? That would explain the performance difference. The OpenCL compiler is still immature so features like loop unrolling may not be as good as you'd like them to be. In many ways OpenCL is a low level language, though, if you really want good performance you'll have to unroll yourself - even though CUDA has had a few years of development you still have to do it there to get good performance, so it may never reach the level you're after in the CL compilers.

Loops will always cause a performance drop. On all architectures there is a substantial drop from control flow. Partly that's because you have to execute control flow instructions which creates overhead and wastes your ALU cycles, partly it might be due to a lack of branch prediction so you end up doing false work. On Evergreen the main problem is largely a branch prediction one, in that Evergreen has no branch predictor, instead relying on a high degree of multithreading to obtain high performance.

All control flow code is executed by a separate execution unit called, on marketing slides, the "ultra-threaded dispatch processor". If you are executing nothing but control flow you will sit on that unit and eventually be control-flow bound. This is largely the same as continuously running control flow code on a CPU and never having any instructions to place in ALU execution slots.

When the dispatcher has a chance, that is if the next instruction it sees is a block of ALU code it will dispatch that to one of the SIMDs. That SIMD will then execute the entire block of ALU instructions, known as a clause, and return to the dispatcher when it is complete.

The higher your ratio of ALU instructions to control-flow instructions, the larger your clauses will be and the higher the occupancy of your SIMD units - ie ALUs. This architecture is very efficient in many ways, but if you write code that has very short clauses (add, test, branch, add, test, branch, for example) you will be constantly switching between the dispatcher and the ALUs and start to suffer from latency that multi-threading cannot cover. The architecture is efficient because when you are using multi-threading correctly in a well balanced implementation you can be executing control-flow code from any work group (or, indeed, any kernel) at the same time as your ALUs are working on completely different units of work. That's not really any different from a CPU hitting pipeline stalls thanks to failed branch prediction, but it's far more visible in your ISA code - which is advantageous because you can use the kernel analyzer to view your ISA code and at a glance see roughly how efficiently it will execute.

I don't know if these will help, and you may have seen the architectural diagrams before, but these are some slides I presented a few times just before Easter which describe the architecture a little:

http://developer.amd.com/gpu_assets/Heterogeneous_Computing_OpenCL_and_the_ATI_Radeon_HD_5870_Archit...

Again, the runtime obviously proceeds when it would be better to return an error (however, I prefer to have a "permissive" runtime that allows maximum performance rather than a "safe" runtime that imposes limits).

Remember as well that OpenCL dispatches are asycnchronous, so it's not possible to have an error occur at call time. If you call clGetEventInfo with the CL_EVENT_COMMAND_EXECUTION_STATUS flag you can obtain the execution status of a command and that should tell you whether it was successful or not.

If that isn't correctly reporting, then it's possible there's a bug. You can also register call backs to get error information through an alternative method.

jcpalmer · ‎05-08-2010

FYI, I show that nVidia has an extension, cl_nv_pragma_unroll. I have not investigated it, but it sounds like something that could be useful for the Cal based implementation too. Everyone wants to write elegant source code, and not resort to lengths shown here. Switching on / off is all that is needed to investigate whether the technique is helpful.

hazeman · ‎05-08-2010

CAL++ has powerful preprocessing language ( C++ itself ). Anytime that for/while/if occurs in CAL++ kernel ( not il_while, il_if ) it should be by "hand" executed during translation to OpenCL. Otherwise OpenCL will be much, much slower. Unfortunately ATI OpenCL doesn't do loop unroling and stuff like that.

As for much more advanced benchmark there is pyrit-calpp core ( http://code.google.com/p/pyrit/ ) . On 5870 OpenCL version does 41K passwords per sec and CAL++ based core 82K .

Also the advantage of using CAL++ is that it works on devices not supported by OpenCL ( like 3xxx cards ) - btw 3850 does ~5500 p/s.

PS. For GTX 480 result is 28K.

cjang · ‎05-08-2010

Lev: Is it possible that opencl uses smaller thread group size?

The number of threads has a big impact on performance. In my experience with matrix multiplication, it is the first order effect. On my HD 5870, highest performance is almost always with 64 threads or the largest number of threads compatible with the outer/inner blocking of the problem. My experience is that it is impossible to predict a priori the optimal number of threads for a given kernel and problem size. There are too many variables. Experimentation is the only way.

blelump: Up to 200 such macros specified in kernel results in execution of kernel takes just a few miliseconds. Measurement time is too small to gain correct result of benchmark.

I too ran into this problem and many others. The OpenCL runtime will often fail silently. The only way to be sure a kernel has run correctly is to verify the output. For benchmarking, I typically use a test pattern of some kind to detect when a kernel has not run.

When the OpenCL kernel is "too big" (has too many variables) or if there are too many threads, I find that kernels can hang. Again, the runtime obviously proceeds when it would be better to return an error (however, I prefer to have a "permissive" runtime that allows maximum performance rather than a "safe" runtime that imposes limits). In order to automate tuning, the safe regions must be determined as you can't rely on the runtime to protect you.

Archives Discussions

Benchmarking OpenCL vs. CAL [with Hazeman's C++ bindings]