cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

david_aiken
Journeyman III

wavefronts and cpus

cpu wavefront sse avx

hi all..

I'm porting a prefix scan algorithm to run on the CPU. It won't work because the CPU has an effective wavefront size of 1. The wavefront is quite critical for the performance on the GPU. Is it possible to write similar OpenCL code for both devices yet ensure optimal performance on both? I don't know enough about AVX/SSE at this point to know whether this question makes sense :>(.

thanks!

0 Likes
5 Replies
david_aiken
Journeyman III

Digging into "Writing Optimal OpenCL Code with the Intel OpenCL SDK"  (apologies all around.. i'm sure there is an AMD equivalent) it seems the emphasis is on writing the code to allow easy transition to the 128-bit SSE registers. There doesn't appear to be an equivalent to the GPU wavefront so i now have separate OpenCL code for CPU and GPU.

0 Likes

It is true that wavefronts do not exist on CPUs, so you could call it that effective size is 1. You should not bother writing 2 codes, one for CPU and one for GPU.

The compiler will redistribute your thread launching in a manner that even you won't recognize it. My cellular automaton used some 64k threads, and it compiled into a 24 threaded program on a dual core Turion. The numbers have little in common, so you should really not try to optimize thread numbers to the CPU. All you should care about is that you should have at least as many workgroups as cores. (One Compute Unit of a CPU is one core)

Considering data, GPU code is quite good for CPUs, since most CPUs have some vector type exec unit (take a closer look at Bulldozer or Sandy Bridge for eg.). GPU kernels suit CPUs pretty well. (You can try to optimize it, but it's not wirth the time.

I had physics simulation which I implemented in CUDA, OpenCL (multi-gpu both) and pure MPI too (same algorithm, but optimized for CPUs) and the OpenCL kernel written for GPUs but compiled onto CPU ran more than 2X fast than the CPU optimized, pure C++ version on an Intel Xeon. (Used StreamSDK to compile onto CPU.)

These are my exepriences.

0 Likes

Meteorhead is correct. Write for the GPU, the CPU should work pretty well with that code.
0 Likes

ok - thanks. I was making an assumption in the GPU-specific code that the number of warps in a group would never be greater than the size of a warp, which breaks trivially on a CPU, but i've worked around it.

Meteorhead: Intel's benchmarks from "Optimizing OpenCL on CPUs" show that hand-tailored OpenCL code runs just a bit slower than hand-tailored SSE/MT code and 25x faster than naive C. Perhaps your C++ simulation was not using SSE originally?

0 Likes

It might have not. I do not know what mpic++ uses by default, but taken the fact it's supposed to be meant for HPC, it should at least produce SSE2, just like the AMD OpenCL compiler.

0 Likes