FYI: I tried the OpenCL, looks very promising, but some integer operations (xor and multiplication) were very very slow. Using Brook+ gave much better performance.

I suspect that OpenCl doesn't use the MULT and XOR instructions directly, but rather software implementations.

could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?