hi all..
I'm porting a prefix scan algorithm to run on the CPU. It won't work because the CPU has an effective wavefront size of 1. The wavefront is quite critical for the performance on the GPU. Is it possible to write similar OpenCL code for both devices yet ensure optimal performance on both? I don't know enough about AVX/SSE at this point to know whether this question makes sense :>(.
thanks!
Digging into "Writing Optimal OpenCL Code with the Intel OpenCL SDK" (apologies all around.. i'm sure there is an AMD equivalent) it seems the emphasis is on writing the code to allow easy transition to the 128-bit SSE registers. There doesn't appear to be an equivalent to the GPU wavefront so i now have separate OpenCL code for CPU and GPU.
It is true that wavefronts do not exist on CPUs, so you could call it that effective size is 1. You should not bother writing 2 codes, one for CPU and one for GPU.
The compiler will redistribute your thread launching in a manner that even you won't recognize it. My cellular automaton used some 64k threads, and it compiled into a 24 threaded program on a dual core Turion. The numbers have little in common, so you should really not try to optimize thread numbers to the CPU. All you should care about is that you should have at least as many workgroups as cores. (One Compute Unit of a CPU is one core)
Considering data, GPU code is quite good for CPUs, since most CPUs have some vector type exec unit (take a closer look at Bulldozer or Sandy Bridge for eg.). GPU kernels suit CPUs pretty well. (You can try to optimize it, but it's not wirth the time.
I had physics simulation which I implemented in CUDA, OpenCL (multi-gpu both) and pure MPI too (same algorithm, but optimized for CPUs) and the OpenCL kernel written for GPUs but compiled onto CPU ran more than 2X fast than the CPU optimized, pure C++ version on an Intel Xeon. (Used StreamSDK to compile onto CPU.)
These are my exepriences.
ok - thanks. I was making an assumption in the GPU-specific code that the number of warps in a group would never be greater than the size of a warp, which breaks trivially on a CPU, but i've worked around it.
Meteorhead: Intel's benchmarks from "Optimizing OpenCL on CPUs" show that hand-tailored OpenCL code runs just a bit slower than hand-tailored SSE/MT code and 25x faster than naive C. Perhaps your C++ simulation was not using SSE originally?
It might have not. I do not know what mpic++ uses by default, but taken the fact it's supposed to be meant for HPC, it should at least produce SSE2, just like the AMD OpenCL compiler.