5 Replies Latest reply on Jan 25, 2011 9:07 PM by Meteorhead

    wavefronts and cpus

      cpu wavefront sse avx

      hi all..

      I'm porting a prefix scan algorithm to run on the CPU. It won't work because the CPU has an effective wavefront size of 1. The wavefront is quite critical for the performance on the GPU. Is it possible to write similar OpenCL code for both devices yet ensure optimal performance on both? I don't know enough about AVX/SSE at this point to know whether this question makes sense :>(.


        • wavefronts and cpus

          Digging into "Writing Optimal OpenCL Code with the Intel OpenCL SDK"  (apologies all around.. i'm sure there is an AMD equivalent) it seems the emphasis is on writing the code to allow easy transition to the 128-bit SSE registers. There doesn't appear to be an equivalent to the GPU wavefront so i now have separate OpenCL code for CPU and GPU.

            • wavefronts and cpus

              It is true that wavefronts do not exist on CPUs, so you could call it that effective size is 1. You should not bother writing 2 codes, one for CPU and one for GPU.

              The compiler will redistribute your thread launching in a manner that even you won't recognize it. My cellular automaton used some 64k threads, and it compiled into a 24 threaded program on a dual core Turion. The numbers have little in common, so you should really not try to optimize thread numbers to the CPU. All you should care about is that you should have at least as many workgroups as cores. (One Compute Unit of a CPU is one core)

              Considering data, GPU code is quite good for CPUs, since most CPUs have some vector type exec unit (take a closer look at Bulldozer or Sandy Bridge for eg.). GPU kernels suit CPUs pretty well. (You can try to optimize it, but it's not wirth the time.

              I had physics simulation which I implemented in CUDA, OpenCL (multi-gpu both) and pure MPI too (same algorithm, but optimized for CPUs) and the OpenCL kernel written for GPUs but compiled onto CPU ran more than 2X fast than the CPU optimized, pure C++ version on an Intel Xeon. (Used StreamSDK to compile onto CPU.)

              These are my exepriences.

            • wavefronts and cpus
              Meteorhead is correct. Write for the GPU, the CPU should work pretty well with that code.
                • wavefronts and cpus

                  ok - thanks. I was making an assumption in the GPU-specific code that the number of warps in a group would never be greater than the size of a warp, which breaks trivially on a CPU, but i've worked around it.

                  Meteorhead: Intel's benchmarks from "Optimizing OpenCL on CPUs" show that hand-tailored OpenCL code runs just a bit slower than hand-tailored SSE/MT code and 25x faster than naive C. Perhaps your C++ simulation was not using SSE originally?